Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Wed, Jun 02, 2010 at 01:17:40PM +1000, Shane Goulden wrote: > I've submitted a bug with CentOS about this (http://bugs.centos.org/view.php?id=4354) but figured I might as well try posting to this list as well. > > I'm getting random crashes with a Xen+DRBD setup. Sorry for the late reply. See below. > As per the bug report at bugs.centos.org: > > This has been happening for quite some time now (long before 2.6.18-194.3.1.el5xen). > > It happens with this combo: > xen-3.0.3-94.el5_4.3 > drbd83-8.3.2-6.el5_3 > kmod-drbd83-xen-8.3.2-6.el5_3 > > And it happened just now with this combo: > xen-3.0.3-105.el5_5.2 > drbd83-8.3.7-1.el5.centos (from CentOS testing) > kmod-drbd83-xen-8.3.7-2.el5.centos (from CentOS testing) > > I've ran hardware diagnostics and it always comes back fine. I don't understand the kernel panic so maybe someone can help me out. Any idea at all as to what it could be? I have a vmcore crash dump from the last crash if more information is required. > > This is the kernel bug message: > > BUG: unable to handle kernel paging request at virtual address e00ce5f0 > printing eip: > c04ecc1e > 204df000 -> *pde = 00000001:0da5e001 > 2105e000 -> *pme = 00000000:3e0f9067 > 000f9000 -> *pte = 00000000:00000000 > Oops: 0000 [0000001] > SMP > last sysfs file: /devices/pci0000:00/0000:00:00.0/irq > Modules linked in: xt_physdev netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat bridge drbd(U) autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_mirror dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi ac parport_pc lp parport joydev sr_mod 8250_pnp sg serio_raw i5000_edac edac_mc 8250 pcspkr ide_cd serial_core cdrom bnx2 dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache usb_storage ata_piix libata mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd > CPU: 5 > EIP: 0061:[<c04ecc1e>] Tainted: G VLI > EFLAGS: 00010286 (2.6.18-194.3.1.el5xen 0000001) > EIP is at csum_partial+0xca/0x120 > eax: 00000000 ebx: c04ecc1e ecx: 0000000b edx: 000005a8 > esi: e00ce618 edi: 000005a8 ebp: 00000034 esp: dbcc2b4c > ds: 007b es: 007b ss: 0069 > Process drbd2_receiver (pid: 6635, ti=dbcc2000 task=ed7abaa0 task.ti=dbcc2000) > Stack: e00ce000 00000034 c05bbf44 e00ce5f0 000005a8 00000000 00000010 ddbb4170 > 00000000 00000020 000005dc eccdd114 c05bcc29 eccdd000 000005a8 ddbb4170 > ecd44acc ecd44ae0 dbcc2c5c c05c1139 d2d260b4 df257bf4 dbcc2c5c 00000003 > Call Trace: > [<c05bbf44>] skb_checksum+0x111/0x27b > [<c05bcc29>] pskb_expand_head+0xd6/0x11a > [<c05c1139>] skb_checksum_help+0x64/0xb3 > [<ee5402ae>] ip_nat_fn+0x42/0x17a [iptable_nat] > [<ee5405dd>] ip_nat_local_fn+0x34/0xa3 [iptable_nat] > [<c05dee38>] dst_output+0x0/0x7 > [<c05d7480>] nf_iterate+0x30/0x61 > [<c05dee38>] dst_output+0x0/0x7 > [<c05d75a6>] nf_hook_slow+0x3a/0x90 > [<c05dee38>] dst_output+0x0/0x7 > [<c05e114f>] ip_queue_xmit+0x3bb/0x40c > [<c05dee38>] dst_output+0x0/0x7 > [<c05c1da3>] dev_hard_start_xmit+0x1b4/0x25a > [<c042530c>] local_bh_enable+0x5/0x81 > [<c05c3946>] dev_queue_xmit+0x329/0x357 > [<c05e19ad>] ip_output+0x22e/0x265 > [<c046e65c>] __kmalloc+0x7c/0x87 > [<c046e65c>] __kmalloc+0x7c/0x87 > [<c05eefba>] tcp_transmit_skb+0x5c7/0x5f5 > [<c05efd31>] tcp_retransmit_skb+0x4d5/0x5b7 > [<c05ee802>] tcp_may_send_now+0x3c/0x49 > [<c05effec>] tcp_xmit_retransmit_queue+0x1d9/0x257 > [<c05eb2dc>] tcp_ack+0x1573/0x16b5 > [<c05ee248>] tcp_rcv_established+0x6cd/0x7c5 > [<c05f3258>] tcp_v4_do_rcv+0x25/0x2b6 > [<c061e758>] _spin_lock_bh+0x8/0x18 > [<c05b9bb8>] release_sock+0x44/0x91 > [<c05ba3a7>] sk_wait_data+0x58/0x98 > [<c043128b>] autoremove_wake_function+0x0/0x2d > [<c05e79d8>] tcp_recvmsg+0x3b6/0x9fa > [<c05b968f>] sock_common_recvmsg+0x2f/0x45 > [<c05b72a1>] sock_recvmsg+0xee/0x141 > [<c043128b>] autoremove_wake_function+0x0/0x2d > [<c04de451>] __make_request+0x319/0x348 > [<c0410000>] _speedstep_get+0x21/0x4b > [<ee58142e>] drbd_recv+0x5a/0xdb [drbd] > [<ee581714>] drbd_recv_header+0x10/0x81 [drbd] > [<ee581ceb>] drbdd+0x1f/0xe6 [drbd] > [<ee584970>] drbdd_init+0xaf/0xf2 [drbd] > [<ee593df6>] drbd_thread_setup+0xfe/0x1a2 [drbd] > [<ee593cf8>] drbd_thread_setup+0x0/0x1a2 [drbd] > [<c0403005>] kernel_thread_helper+0x5/0xb > ======================= > Code: 9c 13 46 a0 13 46 a4 13 46 a8 13 46 ac 13 46 b0 13 46 b4 13 46 b8 13 46 bc 13 46 c0 13 46 c4 13 46 c8 13 46 cc 13 46 d0 13 46 d4 <13> 46 d8 13 46 dc 13 46 e0 13 46 e4 13 46 e8 13 46 ec 13 46 f0 > EIP: [<c04ecc1e>] csum_partial+0xca/0x120 SS:ESP 0069:dbcc2b4c > > > Any ideas? http://www.gossamer-threads.com/lists/drbd/users/17207 http://www.gossamer-threads.com/lists/drbd/users/16962 If workarounds suggested there don't help, the workaround setting drbd module parameter disable_sendpage=1, should work in any case. That module parameter has been introduced in response to those threads. Can be set on module load, or by echo 1 > /sys/modules/drbd/parameters/disable_sendpage after module load. One of those threads ends with the question "Any chance this could be fixed in DRBD, for example by explicitly cleaning up the connection before returning I/O complete to Xen? " Which unfortunately must be answered as "We already do cleanup the connection. If that does not help, we are sorry." -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed