[DRBD-user] Kernel panic in skb_copy_bits

Valentin Vidic Valentin.Vidic at CARNet.hr
Tue Mar 10 13:55:55 CET 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

Recently I've started getting the following kernel panic:

[  366.834266] drbd2: peer( Secondary -> Primary ) 
[  383.920322] drbd1: PingAck did not arrive in time.
[  383.925828] drbd1: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) 
[  383.941857] drbd1: asender terminated
[  383.943390] drbd1: short read expecting header on sock: r=-512
[  383.958256] drbd1: Terminating asender thread
[  383.967285] drbd1: Creating new current UUID
[  383.975075] drbd1: Connection closed
[  383.979220] drbd1: helper command: /sbin/drbdadm outdate-peer minor-1
[  383.993212] drbd1: helper command: /sbin/drbdadm outdate-peer minor-1 exit code 5 (0x500)
[  384.003508] drbd1: outdate-peer helper returned 5 (peer is unreachable, assumed to be dead)
[  384.013728] drbd1: pdsk( DUnknown -> Outdated ) 
[  384.020778] drbd1: susp( 1 -> 0 ) 
[  384.031938] drbd1: conn( NetworkFailure -> Unconnected ) 
[  384.038132] drbd1: receiver terminated
[  384.042484] drbd1: Restarting receiver thread
[  384.047507] drbd1: receiver (re)started
[  384.051945] drbd1: conn( Unconnected -> WFConnection ) 
[  388.193364] drbd2: PingAck did not arrive in time.
[  388.198874] drbd2: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
[  388.218319] drbd2: asender terminated
[  388.218336] drbd2: short read expecting header on sock: r=-512
[  388.229284] drbd2: Terminating asender thread
[  388.237285] drbd2: Connection closed
[  388.241294] drbd2: conn( NetworkFailure -> Unconnected ) 
[  388.246663] drbd2: receiver terminated
[  388.250993] drbd2: Restarting receiver thread
[  388.256014] drbd2: receiver (re)started
[  388.260454] drbd2: conn( Unconnected -> WFConnection ) 
[  390.258485] drbd0: PingAck did not arrive in time.
[  390.264035] drbd0: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) 
[  390.279992] drbd0: asender terminated
[  390.279999] drbd0: short read expecting header on sock: r=-512
[  390.287996] drbd0: Creating new current UUID
[  390.298491] drbd0: Terminating asender thread
[  390.303527] drbd0: Connection closed
[  390.307532] drbd0: helper command: /sbin/drbdadm outdate-peer minor-0
[  390.322060] drbd0: helper command: /sbin/drbdadm outdate-peer minor-0 exit code 5 (0x500)
[  390.332083] drbd0: outdate-peer helper returned 5 (peer is unreachable, assumed to be dead)
[  390.347642] drbd0: pdsk( DUnknown -> Outdated ) 
[  390.352958] drbd0: susp( 1 -> 0 ) 
[  390.357022] drbd0: conn( NetworkFailure -> Unconnected ) 
[  390.363215] drbd0: receiver terminated
[  390.370564] drbd0: Restarting receiver thread
[  390.375561] drbd0: receiver (re)started
[  390.379983] drbd0: conn( Unconnected -> WFConnection ) 
[  393.705742] device vif4.0 entered promiscuous mode
[  393.719435] xenbr0: port 2(vif4.0) entering learning state
[  393.725380] drbd2: susp( 0 -> 1 ) 
[  393.725411] drbd2: helper command: /sbin/drbdadm outdate-peer minor-2
[  393.732256] drbd2: helper command: /sbin/drbdadm outdate-peer minor-2 exit code 5 (0x500)
[  393.732266] drbd2: outdate-peer helper returned 5 (peer is unreachable, assumed to be dead)
[  393.732280] drbd2: role( Secondary -> Primary ) pdsk( DUnknown -> Outdated ) 
[  393.732304] drbd2: susp( 1 -> 0 ) 
[  393.732497] drbd2: Creating new current UUID
[  393.803216] xenbr0: topology change detected, propagating
[  393.810787] xenbr0: port 2(vif4.0) entering forwarding state
[  394.209915] blkback: ring-ref 8, event-channel 20, protocol 1 (x86_64-abi)
[  395.180474] BUG: unable to handle kernel paging request at ffff88001ddbf000
[  395.188634] IP: [<ffffffff803c0011>] skb_copy_bits+0x139/0x216
[  395.188634] PGD 1a6a067 PUD 1a6b067 PMD 1b5a067 PTE 0
[  395.200918] Oops: 0000 [1] SMP 
[  395.200918] CPU 0 
[  395.200918] Modules linked in: xt_physdev sha1_generic drbd cn ipt_REJECT xt_tcpudp xt_multiport nf_conntrack_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ipv6 bridge bonding dm_mod ipmi_si ipmi_devintf ipmi_msghandler 8021q loop parport_pc psmouse usbhid rng_core i2c_i801 hid parport container i2c_core serio_raw ff_memless button pcspkr iTCO_wdt i5000_edac edac_core shpchp pci_hotplug evdev ext3 jbd mbcache sg sr_mod cdrom ide_pci_generic ide_core ata_piix sd_mod ses enclosure floppy ata_generic libata bnx2 dock ehci_hcd firmware_class uhci_hcd megaraid_sas scsi_mod thermal processor fan thermal_sys
[  395.280972] Pid: 0, comm: swapper Not tainted 2.6.26-1-xen-amd64 #1
[  395.280972] RIP: e030:[<ffffffff803c0011>]  [<ffffffff803c0011>] skb_copy_bits+0x139/0x216
[  395.280972] RSP: e02b:ffffffff80595c40  EFLAGS: 00010286
[  395.280972] RAX: 0000000000000062 RBX: ffff88001e4623a8 RCX: 0000000000000588
[  395.280972] RDX: ffff880001554800 RSI: ffff88001ddbf000 RDI: ffff8800015541a0
[  395.280972] RBP: 0000000000000588 R08: 0000000000000588 R09: 00000000000005ea
[  395.280972] R10: ffff880080000000 R11: ffff88001e4623a8 R12: 0000000000000062
[  395.280972] R13: 0000000000000001 R14: 0000000000000062 R15: ffff8800015541a0
[  395.280972] FS:  00007f23afba4730(0000) GS:ffffffff80539000(0000) knlGS:0000000000000000
[  395.280972] CS:  e033 DS: 0000 ES: 0000
[  395.280972] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  395.280972] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  395.280972] Process swapper (pid: 0, threadinfo ffffffff80552000, task ffffffff804fe460)
[  395.280972] Stack:  ffff880001554000 ffffffff803c0c2c 0000000000000010 ffff88001e136000
[  395.280972]  ffff88001e4623a8 ffff88001e4623a8 0000000000000000 0000000000000000
[  395.280972]  00000000000005a8 ffffffff803c0d17 0000000200000000 ffff88001e136000
[  395.280972] Call Trace:
[  395.280972]  <IRQ>  [<ffffffff803c0c2c>] ? pskb_expand_head+0xde/0x143
[  395.280972]  [<ffffffff803c0d17>] ? __pskb_pull_tail+0x86/0x290
[  395.280972]  [<ffffffff803c8a60>] ? dev_queue_xmit+0x153/0x3eb
[  395.280972]  [<ffffffff803e9dbd>] ? ip_queue_xmit+0x29a/0x2ed
[  395.280972]  [<ffffffff8040a69b>] ? inet_sk_rebuild_header+0xf2/0x32f
[  395.280972]  [<ffffffff8020e7b4>] ? get_nsec_offset+0x9/0x2c
[  395.280972]  [<ffffffff8020e7b4>] ? get_nsec_offset+0x9/0x2c
[  395.280972]  [<ffffffff8020e810>] ? local_clock+0x39/0x83
[  395.280972]  [<ffffffff803f9d18>] ? tcp_transmit_skb+0x731/0x76e
[  395.280972]  [<ffffffff8020e8e3>] ? sched_clock+0x15/0x36
[  395.280972]  [<ffffffff803fa9f7>] ? tcp_retransmit_skb+0x4a7/0x5b1
[  395.280972]  [<ffffffff803fd039>] ? tcp_write_timer+0x557/0x77e
[  395.280972]  [<ffffffff8020ee50>] ? timer_interrupt+0x401/0x415
[  395.280972]  [<ffffffff80235dd6>] ? __mod_timer+0xd4/0xe3
[  395.280972]  [<ffffffff803fcae2>] ? tcp_write_timer+0x0/0x77e
[  395.280972]  [<ffffffff8023569f>] ? run_timer_softirq+0x190/0x237
[  395.280972]  [<ffffffff80231c94>] ? __do_softirq+0x77/0x103
[  395.280972]  [<ffffffff8020c13c>] ? call_softirq+0x1c/0x28
[  395.280972]  [<ffffffff8020e08a>] ? do_softirq+0x55/0xbb
[  395.280972]  [<ffffffff8020e16d>] ? do_IRQ+0x7d/0x9a
[  395.280972]  [<ffffffff8037d6c4>] ? evtchn_do_upcall+0x13c/0x1fc
[  395.280972]  [<ffffffff8020bbde>] ? do_hypervisor_callback+0x1e/0x30
[  395.280972]  <EOI>  [<ffffffff8020e795>] ? xen_safe_halt+0x90/0xa6
[  395.280972]  [<ffffffff8020a0c8>] ? xen_idle+0x2e/0x66
[  395.280972]  [<ffffffff80209cd6>] ? cpu_idle+0x97/0xb9
[  395.280972] 
[  395.280972] 
[  395.280972] Code: 00 00 00 00 00 88 ff ff 48 c1 e6 0c 48 01 c6 8b 83 c8 00 00 00 8b 44 07 20 4c 89 ff 48 01 c6 49 63 c4 48 01 c6 49 63 c6 48 29 c6 <f3> a4 65 48 8b 04 25 10 00 00 00 ff 88 44 e0 ff ff 44 29 c5 0f 
[  395.280972] RIP  [<ffffffff803c0011>] skb_copy_bits+0x139/0x216
[  395.280972]  RSP <ffffffff80595c40>
[  395.280972] CR2: ffff88001ddbf000
[  395.280972] ---[ end trace 3b4a27cfc0a95b4b ]---
[  395.280972] Kernel panic - not syncing: Aiee, killing interrupt handler!

It happens on two nodes running Xen on DRBD managed by Heartbeat:
  * node1 with drbd0 and drbd1 as Primary
  * node2 with drbd2 as Primary

The above output happens if I kill node2. Device drbd2 is then 
activated on node1 and domU is started on it but right after that
kernel panics and node1 also reboots. The problem is reproducible and
the kernel panic output is always the same.

I get this with different versions of Debian kernels (2.6.18, 2.6.26),
Xen (3.0.3, 3.2) and DRBD (8.0.13, 8.0.14).

Any ideas what could be causing this? Could it be that node1 is
trying to retransmit a DRBD/TCP packet on a broken connection to
node2?

-- 
Valentin



More information about the drbd-user mailing list