[DRBD-user] Kernel panic with CentOS 6.0, drbd, pacemaker

Wed Aug 24 16:17:21 CEST 2011

On Wed, Aug 24, 2011 at 11:27:56AM +0200, Peter Hinse wrote:
> Hi all,
> 
> I am trying to set up a KVM cluster with CentOS 6.0, corosync/pacemaker,
> dual-primary drbd and KVM. Whenever I restart the corosync process or
> reboot one of the machines, I get a kernel panic and one (or even both)
> machine die.
> 
> I tried all the tipps I found in mailing lists or bugtrackers like
> loading the drbd module with disable_sendpage=1 or disabling
> checksumming and generic segmentation offload via ethtool.

That would be expected, none of those would have anything to do with this issue.

> Same happens with drbd83 and drbd84 packages from elrepo and with a
> self-compiled drbd84 from linbit sources.

I don't think this is anything DRBD specific.

The Oops happens from drbdadm, which is a normal user space tool,
when doing some ioctl on apparently some socket.

Doing some ioctl on some socket file descriptor from userland
should not be able to trigger an oops.

Try to search for similar symptoms not involving DRBD.

Some more comments:

> /etc/drbd.conf:
> 
> global {
>   dialog-refresh	1;
>   minor-count		5;
>   usage-count		no;
> }
> 
> common {
> }
> 
> resource r0 {
>   protocol		C;
>   disk {
>     on-io-error		pass_on;

You actually want "detach" there.

>   }
> 
>   syncer {
>     rate		100M;
>   }
> 
>   net {
>     allow-two-primaries yes;
>     after-sb-0pri	discard-zero-changes;
>     after-sb-1pri	discard-secondary;

Configuring automatic data loss.
Hope this was a concious decision.

>     after-sb-2pri	disconnect;
>   }

You need 
	fencing resource-and-stonith;
and appropriate fencing handlers (the "obliterate peer" one
would probably be the right one).  Of course you need stonith configured
and working in your cluster first.

>   startup {
>     wfc-timeout		10;
>     become-primary-on	both;

This is a "fair weather setup".  It will fail (aka: behave in strange
and unexpected ways) when things go wrong.

Getting a DRBD dual-primary cluster file system setup to work reliably
in face of errors is a bit more complex. 

And you really need fencing (stonith).

>   }
> 
>   on proxy03 {
>     device		/dev/drbd0;
>     address		10.10.10.27:7788;
>     meta-disk		internal;
>     disk		/dev/sysvg/kvm;
>   }
> n   on proxy04 {
>     device		/dev/drbd0;
>     address		10.10.10.28:7788;
>     meta-disk		internal;
>     disk		/dev/sysvg/kvm;
>   }
> }
> 
> last messages from /var/log/messages:
> 
> Aug 24 10:43:55 proxy03 kernel: d-con r0: Handshake successful: Agreed
> network protocol version 100
> Aug 24 10:43:55 proxy03 kernel: d-con r0: conn( WFConnection ->
> WFReportParams )
> Aug 24 10:43:55 proxy03 kernel: d-con r0: Starting asender thread (from
> drbd_r_r0 [19247])
> Aug 24 10:43:55 proxy03 kernel: block drbd0: drbd_sync_handshake:
> Aug 24 10:43:55 proxy03 kernel: block drbd0: self
> 52406041848E78A3:F32F8530A9B9C955:66C1B63DDC072892:66C0B63DDC072893
> bits:0 flags:0
> Aug 24 10:43:55 proxy03 kernel: block drbd0: peer
> F32F8530A9B9C954:0000000000000000:66C1B63DDC072893:66C0B63DDC072893
> bits:0 flags:0
> Aug 24 10:43:55 proxy03 kernel: block drbd0: uuid_compare()=1 by rule 70
> Aug 24 10:43:55 proxy03 kernel: block drbd0: peer( Unknown -> Secondary
> ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
> Aug 24 10:43:55 proxy03 kernel: block drbd0: send bitmap stats
> [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 99.9%
> Aug 24 10:43:55 proxy03 kernel: block drbd0: receive bitmap stats
> [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 99.9%
> Aug 24 10:43:55 proxy03 kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-source minor-0
> Aug 24 10:43:55 proxy03 kernel: BUG: unable to handle kernel NULL
> pointer dereference at 0000000000000038
> Aug 24 10:43:55 proxy03 kernel: IP: [<ffffffff813fda60>]
> sock_ioctl+0x30/0x280
> Aug 24 10:43:55 proxy03 kernel: PGD 242b39067 PUD 2422a0067 PMD 0
> Aug 24 10:43:55 proxy03 kernel: Oops: 0000 [#1] SMP
> 
> Message from syslogd at proxy03 at Aug 24 10:43:55 ...
>  kernel:Oops: 0000 [#1] SMP
> Aug 24 10:43:55 proxy03 kernel: last sysfs file:
> /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
> 
> Message from syslogd at proxy03 at Aug 24 10:43:55 ...
>  kernel:last sysfs file:
> /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
> Aug 24 10:43:55 proxy03 kernel: CPU 3
> Aug 24 10:43:55 proxy03 kernel: Modules linked in: sctp gfs2 dlm
> configfs drbd(U) libcrc32c sunrpc cpufreq_ondemand acpi_cpufreq
> freq_table bonding ipv6 dm_mirror dm_region_hash dm_log cdc_ether usbnet
> mii serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg shpchp
> ioatdma dca i7core_edac edac_core bnx2 ext3 jbd mbcache sd_mod
> crc_t10dif megaraid_sas ata_generic pata_acpi ata_piix dm_mod [last
> unloaded: microcode]
> Aug 24 10:43:55 proxy03 kernel:
> Aug 24 10:43:55 proxy03 kernel: Modules linked in: sctp gfs2 dlm
> configfs drbd(U) libcrc32c sunrpc cpufreq_ondemand acpi_cpufreq
> freq_table bonding ipv6 dm_mirror dm_region_hash dm_log cdc_ether usbnet
> mii serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg shpchp
> ioatdma dca i7core_edac edac_core bnx2 ext3 jbd mbcache sd_mod
> crc_t10dif megaraid_sas ata_generic pata_acpi ata_piix dm_mod [last
> unloaded: microcode]
> Aug 24 10:43:55 proxy03 kernel: Pid: 20331, comm: drbdadm Not tainted
> 2.6.32-71.29.1.el6.x86_64 #1 System x3550 M3 -[7944KBG]-
> Aug 24 10:43:55 proxy03 kernel: RIP: 0010:[<ffffffff813fda60>]
> [<ffffffff813fda60>] sock_ioctl+0x30/0x280
> Aug 24 10:43:55 proxy03 kernel: RSP: 0018:ffff880242949e38  EFLAGS: 00010282
> Aug 24 10:43:55 proxy03 kernel: RAX: 0000000000000000 RBX:
> 0000000000005401 RCX: 00007fff34be3c40
> Aug 24 10:43:55 proxy03 kernel: RDX: 00007fff34be3c40 RSI:
> 0000000000005401 RDI: ffff880242b0b840
> Aug 24 10:43:55 proxy03 kernel: RBP: ffff880242949e58 R08:
> ffffffff81536380 R09: 000000316920e930
> Aug 24 10:43:55 proxy03 kernel: R10: 00007fff34be3a50 R11:
> 0000000000000202 R12: 00007fff34be3c40
> Aug 24 10:43:55 proxy03 kernel: R13: 00007fff34be3c40 R14:
> ffff880252493140 R15: 0000000000000000
> Aug 24 10:43:55 proxy03 kernel: FS:  00007fe14fe14700(0000)
> GS:ffff88002f660000(0000) knlGS:0000000000000000
> Aug 24 10:43:55 proxy03 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Aug 24 10:43:55 proxy03 kernel: CR2: 0000000000000038 CR3:
> 0000000242196000 CR4: 00000000000006e0
> Aug 24 10:43:55 proxy03 kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> Aug 24 10:43:55 proxy03 kernel: DR3: 0000000000000000 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400
> Aug 24 10:43:55 proxy03 kernel: Process drbdadm (pid: 20331, threadinfo
> ffff880242948000, task ffff8802714c34e0)
> Aug 24 10:43:55 proxy03 kernel: Stack:
> 
> Message from syslogd at proxy03 at Aug 24 10:43:55 ...
>  kernel:Stack:
> Aug 24 10:43:55 proxy03 kernel: ffff880242b0b840 ffff880252493188
> 00007fff34be3c40 0000000000000000
> Aug 24 10:43:55 proxy03 kernel: <0> ffff880242949e98 ffffffff8117fdf2
> ffff880242949eb8 0000000000000001
> Aug 24 10:43:55 proxy03 kernel: <0> 0000000000402340 0000003169ad9050
> ffff8802429db080 ffff880242b0b840
> Aug 24 10:43:55 proxy03 kernel: Call Trace:
> 
> Message from syslogd at proxy03 at Aug 24 10:43:55 ...
>  kernel:Call Trace:
> Aug 24 10:43:55 proxy03 kernel: [<ffffffff8117fdf2>] vfs_ioctl+0x22/0xa0
> Aug 24 10:43:55 proxy03 kernel: [<ffffffff8117ff94>] do_vfs_ioctl+0x84/0x580
> Aug 24 10:43:55 proxy03 kernel: [<ffffffff8113676d>] ?
> handle_mm_fault+0x1ed/0x2b0
> Aug 24 10:43:55 proxy03 kernel: [<ffffffff81180511>] sys_ioctl+0x81/0xa0
> Aug 24 10:43:55 proxy03 kernel: [<ffffffff81013172>]
> system_call_fastpath+0x16/0x1b
> Aug 24 10:43:55 proxy03 kernel: Code: 83 ec 20 48 89 1c 24 4c 89 64 24
> 08 4c 89 6c 24 10 4c 89 74 24 18 0f 1f 44 00 00 4c 8b b7 a0 00 00 00 89
> f3 49 89 d4 49 8b 46 38 <4c> 8b 68 38 8d 83 10 76 ff ff 83 f8 0f 76 51
> 8d 83 00 75 ff ff
> 
> Message from syslogd at proxy03 at Aug 24 10:43:55 ...
>  kernel:Code: 83 ec 20 48 89 1c 24 4c 89 64 24 08 4c 89 6c 24 10 4c 89
> 74 24 18 0f 1f 44 00 00 4c 8b b7 a0 00 00 00 89 f3 49 89 d4 49 8b 46 38
> <4c> 8b 68 38 8d 83 10 76 ff ff 83 f8 0f 76 51 8d 83 00 75 ff ff
> Aug 24 10:43:55 proxy03 kernel: RIP  [<ffffffff813fda60>]
> sock_ioctl+0x30/0x280
> Aug 24 10:43:55 proxy03 kernel: RSP <ffff880242949e38>
> Aug 24 10:43:55 proxy03 kernel: CR2: 0000000000000038
> 
> Message from syslogd at proxy03 at Aug 24 10:43:55 ...
>  kernel:CR2: 0000000000000038
> Aug 24 10:43:55 proxy03 kernel: ---[ end trace 2a8c21ee3fd5b98d ]---
> Aug 24 10:43:55 proxy03 kernel: Kernel panic - not syncing: Fatal exception
> 
> Message from syslogd at proxy03 at Aug 24 10:43:55 ...
>  kernel:Kernel panic - not syncing: Fatal exception
> Aug 24 10:43:55 proxy03 kernel: Pid: 20331, comm: drbdadm Tainted: G
>   D    ----------------  2.6.32-71.29.1.el6.x86_64 #1
> Aug 24 10:43:55 proxy03 kernel: Call Trace:
> Aug 24 10:43:55 proxy03 kernel: [<ffffffff814c8b54>] panic+0x78/0x137
> Aug 24 10:43:55 proxy03 kernel: [<ffffffff814ccc24>] oops_end+0xe4/0x100
> Aug 24 10:43:55 proxy03 kernel: [<ffffffff8104656b>] no_context+0xfb/0x260
> Aug 24 10:43:55 proxy03 kernel: [<ffffffff810467f5>]
> __bad_area_nosemaphore+0x125/0x1e0
> 
> Any ideas? More information needed?
> 
> Regards,
> 
> 	Peter

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed