[DRBD-user] LVM->DRBD->Xen Kernel panic after DRBD connection broken

Tue Dec 23 12:47:54 CET 2008

On Mon, Dec 22, 2008 at 01:33:13PM +0100, Maros TIMKO wrote:
> Hi all!
> 
> We are testing a setup of Xen virtualisation platform using CentOS
> distribution DRBD 8.2.6. We are having kernel panics and reboots of
> the primary node just seconds after we plug out the dedicated DRBD
> (crossover) connection. The failure is occuring all the time when we
> pull out the cable if DRBD devices are primary and Xen VMs are
> running. I thought upgrade/downgrade could solve it, but 8.0.13,
> 8.0.14, 8.2.7, 8.3 are acting exactly the same way. So it seems like
> the failure is not DRBD-related but more into Xen/xenified kernel.
>
> However, I would like to ask the audience if anyone has the same
> experience or if there are some hints, how to solve such issue.
> 
> Our setup uses PV -> LVM -> DRBD -> Xen hierarchy.
> Do you think we could solve it if we would change it into PV -> DRBD -> LVM -> Xen?

you have to try.
maybe enable 8kB kernel stacks will help.
though the stack trace below does not look too long.

> Dell PowerEdge 1950 with 2 Broadcom bnx2 NICs
> CentOS 5.2: Linux 2.6.18-92.1.18.el5xen #1 SMP Wed Nov 12 09:48:10 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
> 
> The console output using DRBD 8.3:
> (XEN) Freed 100kB init memory.
> kernel direct mapping tables up to f32be000 @ 1646000-2584000
> PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
> PCI: Not using MMCONFIG.
> Bridge firewalling registered
> virbr0: Dropping NETIF_F_UFO since no NETIF_F_HW_CSUM feature.
> xenbr0: Dropping NETIF_F_UFO since no NETIF_F_HW_CSUM feature.

messages about udp fragmentation offloading enabled while
hardware checksum offloading not available.  which is an invalid
configuration, but should otherwise not be something to worry about.

> Unable to handle kernel paging request at ffff8800eabba000 RIP: 
>  [<ffffffff802124c2>] csum_partial+0x219/0x4bc
> PGD 1646067 PUD 1c4a067 PMD 1da0067 PTE 0
> Oops: 0000 [1] SMP 
> last sysfs file: /module/drbd/parameters/cn_idx
> CPU 5 
> Modules linked in: xt_physdev netloop netbk blktap blkbk bridge
> drbd(U) ipv6 xfrm_nalgo crypto_api ipt_REJECT xt_state xt_tcpudp
> iptable_filter ipt_MASQUERADE iptable_nat ip_nat ip_conntrack
> nfnetlink ip_tables x_tables dm_multipath video sbs backlight i2c_ec
> i2c_core button battery asus_acpi ac parport_pc lp parport ide_cd
> e1000e bnx2 shpchp cdrom i5000_edac edac_mc serio_raw sg pcspkr
> dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata megaraid_sas
> sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
> Pid: 0, comm: swapper Tainted: G      2.6.18-92.1.18.el5xen #1
> RIP: e030:[<ffffffff802124c2>]  [<ffffffff802124c2>] csum_partial+0x219/0x4bc
> RSP: e02b:ffff880009df3b78  EFLAGS: 00010202
> RAX: 0000000000000006 RBX: 0000000000000000 RCX: ffff8800eabba040
> RDX: 0000000000000000 RSI: 0000000000000588 RDI: ffff8800eabba000
> RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000015
> R10: 0000000000000016 R11: 00000000000000b1 R12: 0000000000000054
> R13: 0000000000000054 R14: ffff8800e9280670 R15: 00000000ce876505
> FS:  00002b4db957e340(0000) GS:ffffffff805af280(0000) knlGS:0000000000000000
> CS:  e033 DS: 002b ES: 002b
> Process swapper (pid: 0, threadinfo ffff880001454000, task ffff8800016260c0)
> Stack:  0000000000000588  0000000000000588  ffffffff8023d16d  2ea7df79eaa7c080 
>  0000000000000020  00000040ef6760cc  ffff8800000005dc  0000000000000001 
>  ffff8800e9280670  ffff8800ef6760cc 
> Call Trace:
>  <IRQ>  [<ffffffff8023d16d>] skb_checksum+0x123/0x271
>  [<ffffffff8040a3d9>] skb_checksum_help+0x71/0xd0

something blows up in the calculating of ip/tcp checksums.
DRBD won't have anything to do with this.

try to toggle on or off all checksum offloading before adding the
interfaces into the xen bridge, maybe that helps? (ethtool -K ...)

>  [<ffffffff8831233e>] :iptable_nat:ip_nat_fn+0x56/0x1c3

does it work without iptables?

and a completely blind guess: you did not "overoptimize" the kernel,
i.e. use compiler optimization options for features your cpu does not
have, or does not have in the XEN world?

does your cpu run too hot?

>  [<ffffffff882ee50d>] :ip_conntrack:ip_conntrack_in+0x374/0x46a
>  [<ffffffff883126cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7
>  [<ffffffff802351ae>] nf_iterate+0x41/0x7d
>  [<ffffffff80428040>] dst_output+0x0/0xe
>  [<ffffffff802588e4>] nf_hook_slow+0x58/0xbc
>  [<ffffffff80428040>] dst_output+0x0/0xe
>  [<ffffffff80235662>] ip_queue_xmit+0x431/0x4a1
>  [<ffffffff80222990>] tcp_transmit_skb+0x64a/0x682
>  [<ffffffff804320f4>] tcp_retransmit_skb+0x53d/0x638
>  [<ffffffff8043362a>] tcp_write_timer+0x0/0x699
>  [<ffffffff80433aa2>] tcp_write_timer+0x478/0x699
>  [<ffffffff80292b1e>] run_timer_softirq+0x13f/0x1c6
>  [<ffffffff802127c7>] __do_softirq+0x62/0xde
>  [<ffffffff80260da0>] call_softirq+0x1c/0x27c
>  [<ffffffff8026dcd2>] do_softirq+0x31/0x98
>  [<ffffffff8026db4d>] do_IRQ+0xec/0xf5
>  [<ffffffff803a0a98>] evtchn_do_upcall+0x86/0xe0
>  [<ffffffff802608d2>] do_hypervisor_callback+0x1e/0x2c
>  <EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
>  [<ffffffff8026f139>] raw_safe_halt+0x84/0xa8
>  [<ffffffff8026c683>] xen_idle+0x38/0x4a
>  [<ffffffff8024aa45>] cpu_idle+0x97/0xba
> 
> 
> Code: 4c 03 07 4c 13 47 08 4c 13 47 10 4c 13 47 18 4c 13 47 20 4c 
> RIP  [<ffffffff802124c2>] csum_partial+0x219/0x4bc
>  RSP <ffff880009df3b78>
> CR2: ffff8800eabba000
>  <0>Kernel panic - not syncing: Fatal exception
>  (XEN) Domain 0 crashed: rebooting machine in 5 seconds.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed