[DRBD-user] Kernel BUG - Xen & DRBD setup

Wed Jun 9 09:46:50 CEST 2010

On Wed, Jun 02, 2010 at 01:17:40PM +1000, Shane Goulden wrote:
> I've submitted a bug with CentOS about this (http://bugs.centos.org/view.php?id=4354) but figured I might as well try posting to this list as well. 
> 
> I'm getting random crashes with a Xen+DRBD setup. 

Sorry for the late reply.
See below.

> As per the bug report at bugs.centos.org: 
> 
> This has been happening for quite some time now (long before 2.6.18-194.3.1.el5xen). 
> 
> It happens with this combo: 
> xen-3.0.3-94.el5_4.3 
> drbd83-8.3.2-6.el5_3 
> kmod-drbd83-xen-8.3.2-6.el5_3 
> 
> And it happened just now with this combo: 
> xen-3.0.3-105.el5_5.2 
> drbd83-8.3.7-1.el5.centos (from CentOS testing) 
> kmod-drbd83-xen-8.3.7-2.el5.centos (from CentOS testing) 
> 
> I've ran hardware diagnostics and it always comes back fine. I don't understand the kernel panic so maybe someone can help me out. Any idea at all as to what it could be? I have a vmcore crash dump from the last crash if more information is required. 
> 
> This is the kernel bug message: 
> 
> BUG: unable to handle kernel paging request at virtual address e00ce5f0 
> printing eip: 
> c04ecc1e 
> 204df000 -> *pde = 00000001:0da5e001 
> 2105e000 -> *pme = 00000000:3e0f9067 
> 000f9000 -> *pte = 00000000:00000000 
> Oops: 0000 [0000001] 
> SMP 
> last sysfs file: /devices/pci0000:00/0000:00:00.0/irq 
> Modules linked in: xt_physdev netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat bridge drbd(U) autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi ac parport_pc lp parport joydev sr_mod 8250_pnp sg serio_raw i5000_edac edac_mc 8250 pcspkr ide_cd serial_core cdrom bnx2 dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache usb_storage ata_piix libata mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd 
> CPU: 5 
> EIP: 0061:[<c04ecc1e>] Tainted: G VLI 
> EFLAGS: 00010286 (2.6.18-194.3.1.el5xen 0000001) 
> EIP is at csum_partial+0xca/0x120 
> eax: 00000000 ebx: c04ecc1e ecx: 0000000b edx: 000005a8 
> esi: e00ce618 edi: 000005a8 ebp: 00000034 esp: dbcc2b4c 
> ds: 007b es: 007b ss: 0069 
> Process drbd2_receiver (pid: 6635, ti=dbcc2000 task=ed7abaa0 task.ti=dbcc2000) 
> Stack: e00ce000 00000034 c05bbf44 e00ce5f0 000005a8 00000000 00000010 ddbb4170 
> 00000000 00000020 000005dc eccdd114 c05bcc29 eccdd000 000005a8 ddbb4170 
> ecd44acc ecd44ae0 dbcc2c5c c05c1139 d2d260b4 df257bf4 dbcc2c5c 00000003 
> Call Trace: 
> [<c05bbf44>] skb_checksum+0x111/0x27b 
> [<c05bcc29>] pskb_expand_head+0xd6/0x11a 
> [<c05c1139>] skb_checksum_help+0x64/0xb3 
> [<ee5402ae>] ip_nat_fn+0x42/0x17a [iptable_nat] 
> [<ee5405dd>] ip_nat_local_fn+0x34/0xa3 [iptable_nat] 
> [<c05dee38>] dst_output+0x0/0x7 
> [<c05d7480>] nf_iterate+0x30/0x61 
> [<c05dee38>] dst_output+0x0/0x7 
> [<c05d75a6>] nf_hook_slow+0x3a/0x90 
> [<c05dee38>] dst_output+0x0/0x7 
> [<c05e114f>] ip_queue_xmit+0x3bb/0x40c 
> [<c05dee38>] dst_output+0x0/0x7 
> [<c05c1da3>] dev_hard_start_xmit+0x1b4/0x25a 
> [<c042530c>] local_bh_enable+0x5/0x81 
> [<c05c3946>] dev_queue_xmit+0x329/0x357 
> [<c05e19ad>] ip_output+0x22e/0x265 
> [<c046e65c>] __kmalloc+0x7c/0x87 
> [<c046e65c>] __kmalloc+0x7c/0x87 
> [<c05eefba>] tcp_transmit_skb+0x5c7/0x5f5 
> [<c05efd31>] tcp_retransmit_skb+0x4d5/0x5b7 
> [<c05ee802>] tcp_may_send_now+0x3c/0x49 
> [<c05effec>] tcp_xmit_retransmit_queue+0x1d9/0x257 
> [<c05eb2dc>] tcp_ack+0x1573/0x16b5 
> [<c05ee248>] tcp_rcv_established+0x6cd/0x7c5 
> [<c05f3258>] tcp_v4_do_rcv+0x25/0x2b6 
> [<c061e758>] _spin_lock_bh+0x8/0x18 
> [<c05b9bb8>] release_sock+0x44/0x91 
> [<c05ba3a7>] sk_wait_data+0x58/0x98 
> [<c043128b>] autoremove_wake_function+0x0/0x2d 
> [<c05e79d8>] tcp_recvmsg+0x3b6/0x9fa 
> [<c05b968f>] sock_common_recvmsg+0x2f/0x45 
> [<c05b72a1>] sock_recvmsg+0xee/0x141 
> [<c043128b>] autoremove_wake_function+0x0/0x2d 
> [<c04de451>] __make_request+0x319/0x348 
> [<c0410000>] _speedstep_get+0x21/0x4b 
> [<ee58142e>] drbd_recv+0x5a/0xdb [drbd] 
> [<ee581714>] drbd_recv_header+0x10/0x81 [drbd] 
> [<ee581ceb>] drbdd+0x1f/0xe6 [drbd] 
> [<ee584970>] drbdd_init+0xaf/0xf2 [drbd] 
> [<ee593df6>] drbd_thread_setup+0xfe/0x1a2 [drbd] 
> [<ee593cf8>] drbd_thread_setup+0x0/0x1a2 [drbd] 
> [<c0403005>] kernel_thread_helper+0x5/0xb 
> ======================= 
> Code: 9c 13 46 a0 13 46 a4 13 46 a8 13 46 ac 13 46 b0 13 46 b4 13 46 b8 13 46 bc 13 46 c0 13 46 c4 13 46 c8 13 46 cc 13 46 d0 13 46 d4 <13> 46 d8 13 46 dc 13 46 e0 13 46 e4 13 46 e8 13 46 ec 13 46 f0 
> EIP: [<c04ecc1e>] csum_partial+0xca/0x120 SS:ESP 0069:dbcc2b4c
> 
> 
> Any ideas?

http://www.gossamer-threads.com/lists/drbd/users/17207
http://www.gossamer-threads.com/lists/drbd/users/16962

If workarounds suggested there don't help, the workaround setting drbd module
parameter disable_sendpage=1, should work in any case.

That module parameter has been introduced in response to those threads.

Can be set on module load,
or by echo 1 > /sys/modules/drbd/parameters/disable_sendpage
after module load.

One of those threads ends with the question
   "Any chance this could be fixed in DRBD, for example by explicitly cleaning
    up the connection before returning I/O complete to Xen? "

Which unfortunately must be answered as
"We already do cleanup the connection.
 If that does not help, we are sorry."

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed