Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Mon, Dec 22, 2008 at 01:33:13PM +0100, Maros TIMKO wrote: > Hi all! > > We are testing a setup of Xen virtualisation platform using CentOS > distribution DRBD 8.2.6. We are having kernel panics and reboots of > the primary node just seconds after we plug out the dedicated DRBD > (crossover) connection. The failure is occuring all the time when we > pull out the cable if DRBD devices are primary and Xen VMs are > running. I thought upgrade/downgrade could solve it, but 8.0.13, > 8.0.14, 8.2.7, 8.3 are acting exactly the same way. So it seems like > the failure is not DRBD-related but more into Xen/xenified kernel. > > However, I would like to ask the audience if anyone has the same > experience or if there are some hints, how to solve such issue. > > Our setup uses PV -> LVM -> DRBD -> Xen hierarchy. > Do you think we could solve it if we would change it into PV -> DRBD -> LVM -> Xen? you have to try. maybe enable 8kB kernel stacks will help. though the stack trace below does not look too long. > Dell PowerEdge 1950 with 2 Broadcom bnx2 NICs > CentOS 5.2: Linux 2.6.18-92.1.18.el5xen #1 SMP Wed Nov 12 09:48:10 EST 2008 x86_64 x86_64 x86_64 GNU/Linux > > The console output using DRBD 8.3: > (XEN) Freed 100kB init memory. > kernel direct mapping tables up to f32be000 @ 1646000-2584000 > PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved > PCI: Not using MMCONFIG. > Bridge firewalling registered > virbr0: Dropping NETIF_F_UFO since no NETIF_F_HW_CSUM feature. > xenbr0: Dropping NETIF_F_UFO since no NETIF_F_HW_CSUM feature. messages about udp fragmentation offloading enabled while hardware checksum offloading not available. which is an invalid configuration, but should otherwise not be something to worry about. > Unable to handle kernel paging request at ffff8800eabba000 RIP: > [<ffffffff802124c2>] csum_partial+0x219/0x4bc > PGD 1646067 PUD 1c4a067 PMD 1da0067 PTE 0 > Oops: 0000 [1] SMP > last sysfs file: /module/drbd/parameters/cn_idx > CPU 5 > Modules linked in: xt_physdev netloop netbk blktap blkbk bridge > drbd(U) ipv6 xfrm_nalgo crypto_api ipt_REJECT xt_state xt_tcpudp > iptable_filter ipt_MASQUERADE iptable_nat ip_nat ip_conntrack > nfnetlink ip_tables x_tables dm_multipath video sbs backlight i2c_ec > i2c_core button battery asus_acpi ac parport_pc lp parport ide_cd > e1000e bnx2 shpchp cdrom i5000_edac edac_mc serio_raw sg pcspkr > dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata megaraid_sas > sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd > Pid: 0, comm: swapper Tainted: G 2.6.18-92.1.18.el5xen #1 > RIP: e030:[<ffffffff802124c2>] [<ffffffff802124c2>] csum_partial+0x219/0x4bc > RSP: e02b:ffff880009df3b78 EFLAGS: 00010202 > RAX: 0000000000000006 RBX: 0000000000000000 RCX: ffff8800eabba040 > RDX: 0000000000000000 RSI: 0000000000000588 RDI: ffff8800eabba000 > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000015 > R10: 0000000000000016 R11: 00000000000000b1 R12: 0000000000000054 > R13: 0000000000000054 R14: ffff8800e9280670 R15: 00000000ce876505 > FS: 00002b4db957e340(0000) GS:ffffffff805af280(0000) knlGS:0000000000000000 > CS: e033 DS: 002b ES: 002b > Process swapper (pid: 0, threadinfo ffff880001454000, task ffff8800016260c0) > Stack: 0000000000000588 0000000000000588 ffffffff8023d16d 2ea7df79eaa7c080 > 0000000000000020 00000040ef6760cc ffff8800000005dc 0000000000000001 > ffff8800e9280670 ffff8800ef6760cc > Call Trace: > <IRQ> [<ffffffff8023d16d>] skb_checksum+0x123/0x271 > [<ffffffff8040a3d9>] skb_checksum_help+0x71/0xd0 something blows up in the calculating of ip/tcp checksums. DRBD won't have anything to do with this. try to toggle on or off all checksum offloading before adding the interfaces into the xen bridge, maybe that helps? (ethtool -K ...) > [<ffffffff8831233e>] :iptable_nat:ip_nat_fn+0x56/0x1c3 does it work without iptables? and a completely blind guess: you did not "overoptimize" the kernel, i.e. use compiler optimization options for features your cpu does not have, or does not have in the XEN world? does your cpu run too hot? > [<ffffffff882ee50d>] :ip_conntrack:ip_conntrack_in+0x374/0x46a > [<ffffffff883126cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7 > [<ffffffff802351ae>] nf_iterate+0x41/0x7d > [<ffffffff80428040>] dst_output+0x0/0xe > [<ffffffff802588e4>] nf_hook_slow+0x58/0xbc > [<ffffffff80428040>] dst_output+0x0/0xe > [<ffffffff80235662>] ip_queue_xmit+0x431/0x4a1 > [<ffffffff80222990>] tcp_transmit_skb+0x64a/0x682 > [<ffffffff804320f4>] tcp_retransmit_skb+0x53d/0x638 > [<ffffffff8043362a>] tcp_write_timer+0x0/0x699 > [<ffffffff80433aa2>] tcp_write_timer+0x478/0x699 > [<ffffffff80292b1e>] run_timer_softirq+0x13f/0x1c6 > [<ffffffff802127c7>] __do_softirq+0x62/0xde > [<ffffffff80260da0>] call_softirq+0x1c/0x27c > [<ffffffff8026dcd2>] do_softirq+0x31/0x98 > [<ffffffff8026db4d>] do_IRQ+0xec/0xf5 > [<ffffffff803a0a98>] evtchn_do_upcall+0x86/0xe0 > [<ffffffff802608d2>] do_hypervisor_callback+0x1e/0x2c > <EOI> [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 > [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 > [<ffffffff8026f139>] raw_safe_halt+0x84/0xa8 > [<ffffffff8026c683>] xen_idle+0x38/0x4a > [<ffffffff8024aa45>] cpu_idle+0x97/0xba > > > Code: 4c 03 07 4c 13 47 08 4c 13 47 10 4c 13 47 18 4c 13 47 20 4c > RIP [<ffffffff802124c2>] csum_partial+0x219/0x4bc > RSP <ffff880009df3b78> > CR2: ffff8800eabba000 > <0>Kernel panic - not syncing: Fatal exception > (XEN) Domain 0 crashed: rebooting machine in 5 seconds. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed