[DRBD-user] Server reboot on DRBD heavy load problem

J. Ryan Earl oss at jryanearl.us
Tue Sep 28 16:03:45 CEST 2010

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


>
> And I try 2.6.35-6 and DRBD-8.3.8
> And get this:
>
> [55921.451479] ------------[ cut here ]------------
> [55921.451506] kernel BUG at mm/slub.c:2834!
> [55921.451530] invalid opcode: 0000 [#1] SMP
> [55921.451557] last sysfs file: /sys/module/drbd/parameters/cn_idx
> [55921.451584] CPU 1
> [55921.451589] Modules linked in: ocfs2 jbd2 quota_tree drbd xt_multiport
> sha1_generic hmac lru_cache cn xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4
> xt_state nf_conntrack iptable_filter ip_tables x_tables ocf
> s2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue
> configfs ext2 loop snd_pcm i5000_edac edac_core i5k_amb snd_timer processor
> snd evdev button rng_core shpchp soundcore snd_page_alloc tpm
> _tis pci_hotplug psmouse dcdbas tpm pcspkr tpm_bios serio_raw ext3 jbd
> mbcache ide_cd_mod uhci_hcd cdrom ata_generic ata_piix libata ses sd_mod
> enclosure crc_t10dif ehci_hcd megaraid_sas piix ide_core usbcor
> e scsi_mod nls_base bnx2 thermal thermal_sys [last unloaded: drbd]
> [55921.451964]
> [55921.451984] Pid: 2995, comm: udevd Not tainted 2.6.35.6 #1
> 0NH278/PowerEdge 2950
> [55921.452027] RIP: 0010:[<ffffffff810df05d>]  [<ffffffff810df05d>]
> kfree+0x5b/0xc8
> [55921.452076] RSP: 0018:ffff88012aa61d58  EFLAGS: 00010246
> [55921.452102] RAX: 0200000000000400 RBX: ffff880100000001 RCX:
> 0000000000000002
> [55921.452131] RDX: ffffea0000000000 RSI: ffffea0003800000 RDI:
> ffff880100000001
> [55921.452160] RBP: ffff8800375d8f00 R08: 0000000000000000 R09:
> 0000000000000000
> [55921.452189] R10: ffff88012bce1070 R11: ffff8800375d8f00 R12:
> ffffffff810f061e
> [55921.452219] R13: 0000000018000040 R14: ffff88012c375cf0 R15:
> ffff88012bce1070
> [55921.452248] FS:  00007f7646a967a0(0000) GS:ffff880001a40000(0000)
> knlGS:0000000000000000
> [55921.452293] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [55921.452319] CR2: 00007f7646a9c000 CR3: 000000012d245000 CR4:
> 00000000000006e0
> [55921.452349] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [55921.452378] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [55921.452407] Process udevd (pid: 2995, threadinfo ffff88012aa60000, task
> ffff880121f4d890)
> [55921.452451] Stack:
> [55921.452471]  0000000000000000 ffff8800375d8f00 ffff88012bce1070
> ffffffff810f061e
> [55921.452505] <0> ffff880108000080 000000002bce1070 ffff88012c3759d0
> ffff880100000001
> [55921.452556] <0> 0000029d0000029d ffff8800375d8fa0 ffff88012f8a4900
> ffff8800375d8f00
> [55921.452623] Call Trace:
> [55921.452647]  [<ffffffff810f061e>] ? vfs_rename+0x3d3/0x3e4
> [55921.452674]  [<ffffffff810f1c78>] ? sys_renameat+0x1aa/0x22b
> [55921.452702]  [<ffffffff810d13ab>] ? free_pages_and_swap_cache+0x53/0x6e
> [55921.452732]  [<ffffffff810c83fb>] ? tlb_finish_mmu+0x2a/0x33
> [55921.452759]  [<ffffffff810c8470>] ? remove_vma+0x6c/0x74
> [55921.452786]  [<ffffffff810c95d8>] ? do_munmap+0x307/0x329
> [55921.452814]  [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
> [55921.452841] Code: c5 10 48 83 7d 00 00 eb e6 48 83 fb 10 0f 86 80 00 00
> 00 48 89 df e8 a9 f0 ff ff 48 89 c6 48 8b 00 84 c0 78 16 66 a9 00 c0 75 04
> <0f> 0b eb fe 5b 5d 41 5c 48 89 f7 e9 7d 75 fd ff 48 8b 4
> c 24 18
> [55921.453030] RIP  [<ffffffff810df05d>] kfree+0x5b/0xc8
> [55921.453057]  RSP <ffff88012aa61d58>
> [55921.453437] ---[ end trace 3f96fca7c9cbfb03 ]---
> [55921.454368] JBD: Ignoring recovery information on journal
> [55921.461099] general protection fault: 0000 [#2] SMP
> [55921.461269] last sysfs file: /sys/module/drbd/parameters/cn_idx
> [55921.461338] CPU 1
> [55921.461385] Modules linked in: ocfs2 jbd2 quota_tree drbd xt_multiport
> sha1_generic hmac lru_cache cn xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4
> xt_state nf_conntrack iptable_filter ip_tables x_tables ocfs2_dlmfs
> ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs ext2
> loop snd_pcm i5000_edac edac_core i5k_amb snd_timer processor snd evdev
> button rng_core shpchp soundcore snd_page_alloc tpm_tis pci_hotplug psmouse
> dcdbas tpm pcspkr tpm_bios serio_raw ext3 jbd mbcache ide_cd_mod uhci_hcd
> cdrom ata_generic ata_piix libata ses sd_mod enclosure crc_t10dif ehci_hcd
> megaraid_sas piix ide_core usbcore scsi_mod nls_base bnx2 thermal
> thermal_sys [last unloaded: drbd]
> [55921.464840]
> [55921.464902] Pid: 9281, comm: mount.ocfs2 Tainted: G      D 2.6.35.6 #1
> 0NH278/PowerEdge 2950
> [55921.464990] RIP: 0010:[<ffffffff810dffaa>]  [<ffffffff810dffaa>]
> __kmalloc+0xd3/0x136
> [55921.465065] RSP: 0018:ffff880103e21ba8  EFLAGS: 00010006
> [55921.465065] RAX: 0000000000000000 RBX: 0800000000000000 RCX:
> ffffffffa0449421
> [55921.465065] RDX: 0000000000000000 RSI: ffff88012cfaf000 RDI:
> 0000000000000004
> [55921.465065] RBP: ffffffff81625520 R08: ffff880001a524d0 R09:
> 0000000000000000
> [55921.465065] R10: ffff88012cfaf260 R11: ffff88012ca24420 R12:
> 000000000000000a
> [55921.465065] R13: 00000000000080d0 R14: 00000000000080d0 R15:
> 0000000000000246
> [55921.465065] FS:  00007fee60afe720(0000) GS:ffff880001a40000(0000)
> knlGS:0000000000000000
> [55921.465065] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [55921.465065] CR2: 00007f764630ab8c CR3: 000000012eae3000 CR4:
> 00000000000006e0
> [55921.465065] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [55921.465065] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [55921.465065] Process mount.ocfs2 (pid: 9281, threadinfo ffff880103e20000,
> task ffff88012ca24420)
> [55921.465065] Stack:
> [55921.465065]  0000000000000000 ffffffffa0449421 ffff88012cfaf108
> ffff88012cfaf000
> [55921.465065] <0> ffff88012cfaf000 ffff88012cfaf000 ffff88012aa2e000
> ffff88012ca24420
> [55921.465065] <0> 0000000000000200 ffffffffa0449421 0000000000000000
> ffffffffa044ccec
> [55921.465065] Call Trace:
> [55921.465065]  [<ffffffffa0449421>] ?
> ocfs2_compute_replay_slots+0x31/0x10f [ocfs2]
> [55921.465065]  [<ffffffffa0449421>] ?
> ocfs2_compute_replay_slots+0x31/0x10f [ocfs2]
> [55921.465065]  [<ffffffffa044ccec>] ? ocfs2_journal_load+0x1d0/0x2b1
> [ocfs2]
> [55921.465065]  [<ffffffffa0473525>] ? ocfs2_fill_super+0x19a2/0x2101
> [ocfs2]
> [55921.465065]  [<ffffffff8118aa8f>] ? snprintf+0x36/0x3b
> [55921.465065]  [<ffffffff810e9f9e>] ? get_sb_bdev+0x137/0x19a
> [55921.465065]  [<ffffffffa0471b83>] ? ocfs2_fill_super+0x0/0x2101 [ocfs2]
> [55921.465065]  [<ffffffff810e9675>] ? vfs_kern_mount+0xa6/0x196
> [55921.465065]  [<ffffffff810e97c4>] ? do_kern_mount+0x49/0xe7
> [55921.465065]  [<ffffffff810fdabb>] ? do_mount+0x75c/0x7d6
> [55921.465065]  [<ffffffff810d829a>] ? alloc_pages_current+0x9f/0xc2
> [55921.465065]  [<ffffffff810fdbbd>] ? sys_mount+0x88/0xc3
> [55921.465065]  [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
> [55921.465065] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b
> 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18
> <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1
> [55921.465065] RIP  [<ffffffff810dffaa>] __kmalloc+0xd3/0x136
> [55921.465065]  RSP <ffff880103e21ba8>
> [55921.465065] ---[ end trace 3f96fca7c9cbfb04 ]---
> [55941.839304] o2net: accepted connection from node mail02.fxclub.org (num
> 1) at 192.168.1.2:7777
> [55946.003594] o2dlm: Node 1 joins domain E4B99C68B65449068DC403326917DC29
> [55946.003673] o2dlm: Nodes in domain E4B99C68B65449068DC403326917DC29: 0 1
>
>
> After few min:
>
> Message from syslogd at mail01 at Sep 28 07:27:03 ...
>  kernel:[57519.645448] general protection fault: 0000 [#3] SMP
>
> Message from syslogd at mail01 at Sep 28 07:27:03 ...
>  kernel:[57519.645615] last sysfs file: /sys/module/drbd/parameters/cn_idx
>
> Message from syslogd at mail01 at Sep 28 07:27:03 ...
>  kernel:[57519.649409] Stack:
>
> Message from syslogd at mail01 at Sep 28 07:27:03 ...
>  kernel:[57519.649409] Call Trace:
>
> Message from syslogd at mail01 at Sep 28 07:27:03 ...
>  kernel:[57519.649409] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00
> 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48
> 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1
>
> And halt.



Clearly an OCFS2 problem.

Having built Oracle RAC clusters with OCFS and OCFS2, I would say that OCFS2
just isn't of stable quality, and is only suitable for light use.  There
have been open, outstanding bugs with no development for years.  You should
try GFS2 which is production ready.
-jr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20100928/78531971/attachment.htm>


More information about the drbd-user mailing list