[DRBD-user] strange drbd bug

Matteo Tescione matteo at RMnet.it
Tue Oct 14 20:48:05 CEST 2014

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Lars,

 
> exact DRBD version?
> in tree, out of tree?
> kernel? patches applied? which ones?
> 
> DRBD Configuration and IO stack?
> What do you do to trigger it?
> How easy can you reproduce?
> Logs leading up to this?

8.4.4 downloaded from official repository, compiled out of tree. Kernel is 3.12.14 with only scst req_fifo patch & iscsi_putpage_callback patch applied.

I do nothing, the 2 box are interconnected with 40-gbit back-to-back infiniband qle7342, but I seen several disconnection event repeating random even with no IO, and noticed that once in a month the primary box crashes. So I turned on the console logging and captured the trace you see below.

As I said, occurs only once in 30-40 days.

Logs before this panic:







drbd TOP: Restarting receiver thread

drbd TOP: receiver (re)started

drbd TOP: conn( Unconnected -> WFConnection )

drbd TOP: helper command: /sbin/drbdadm fence-peer TOP exit code 4 (0x400)

drbd TOP: fence-peer helper returned 4 (peer was fenced)

drbd TOP: pdsk( DUnknown -> Outdated )

drbd TOP: Handshake successful: Agreed network protocol version 101

drbd TOP: Agreed to support TRIM on protocol level

drbd TOP: Peer authenticated using 20 bytes HMAC

drbd TOP: conn( WFConnection -> WFReportParams )

drbd TOP: Starting asender thread (from drbd_r_TOP [4385])

block drbd0: drbd_sync_handshake:

block drbd0: self 34B181D069424F63:3921A73508F919D7:673BF468C9830995:673AF468C9830995 bits:59293 flags:0

block drbd0: peer 3921A73508F919D6:0000000000000000:673BF468C9830994:673AF468C9830995 bits:0 flags:0

block drbd0: uuid_compare()=1 by rule 70

block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent )

block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0%

block drbd1: drbd_sync_handshake:

block drbd1: self D532468B0E14258D:344C01173D6B201D:659C03CE91D88EBB:659B03CE91D88EBB bits:44 flags:0

block drbd1: peer 344C01173D6B201C:0000000000000000:659C03CE91D88EBA:659B03CE91D88EBB bits:0 flags:0

block drbd1: uuid_compare()=1 by rule 70

block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent )

block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0%

block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0%

block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0

block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)

block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )

block drbd0: Began resync as SyncSource (will sync 237172 KB [59293 bits set]).

block drbd0: updated sync UUID 34B181D069424F63:3922A73508F919D7:3921A73508F919D7:673BF468C9830995

block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0%

block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1

block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0)

block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )

block drbd1: Began resync as SyncSource (will sync 196 KB [49 bits set]).

block drbd1: updated sync UUID D532468B0E14258D:344D01173D6B201D:344C01173D6B201D:659C03CE91D88EBB

block drbd1: Resync done (total 1 sec; paused 0 sec; 196 K/sec)

block drbd1: updated UUIDs D532468B0E14258D:0000000000000000:344D01173D6B201D:344C01173D6B201D

block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )

block drbd0: Resync done (total 6 sec; paused 0 sec; 39528 K/sec)

block drbd0: updated UUIDs 34B181D069424F63:0000000000000000:3922A73508F919D7:3921A73508F919D7

block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )

block drbd1: Remote failed to finish a request within ko-count * timeout

block drbd1: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )

drbd TOP: asender terminated

drbd TOP: Terminating drbd_a_TOP

block drbd1: md_sync_timer expired! Worker calls drbd_md_sync().

block drbd1: new current UUID 3C68CD64811C1C41:D532468B0E14258D:344D01173D6B201D:344C01173D6B201D

drbd TOP: Connection closed

block drbd0: State change failed: Refusing to be Primary while peer is not outdated

block drbd0:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }

block drbd0:  wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- }

drbd TOP: State change failed: Refusing to be Primary while peer is not outdated

drbd TOP:  mask = 0x1f0 val = 0x20

drbd TOP:  old_conn:WFReportParams wanted_conn:Unconnected

drbd TOP: Expected cstate < C_WF_REPORT_PARAMS

drbd TOP: receiver terminated

drbd TOP: Restarting receiver thread

drbd TOP: receiver (re)started

block drbd0: State change failed: Refusing to be Primary while peer is not outdated

block drbd0:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }

block drbd0:  wanted = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown r----- }

drbd TOP: State change failed: Refusing to be Primary while peer is not outdated

drbd TOP:  mask = 0x1f0 val = 0x80

drbd TOP:  old_conn:WFReportParams wanted_conn:WFConnection

drbd TOP: Connection closed

block drbd0: State change failed: Refusing to be Primary while peer is not outdated

block drbd0:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }

block drbd0:  wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- }

drbd TOP: State change failed: Refusing to be Primary while peer is not outdated

drbd TOP:  mask = 0x1f0 val = 0x20

drbd TOP:  old_conn:WFReportParams wanted_conn:Unconnected

drbd TOP: Expected cstate < C_WF_REPORT_PARAMS

drbd TOP: receiver terminated

drbd TOP: Terminating drbd_r_TOP

block drbd0: Remote failed to finish a request within ko-count * timeout

block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )

drbd TOP: Starting receiver thread (from swapper/0 [0])

BUG: scheduling while atomic: swapper/0/0/0x00000100



Best Regards,


--
matteo

> The stack traces look unusual.
> You happen to try to trick DRBD into doing something it cannot do?
> 
> > Discussed this topic with Bart from SCST, he suggested:
> > 
> > 
> > "To me this looks like a DRBD bug. If a timeout of a DRBD request
> > occurs
> > the DRBD timeout handler (request_timer_fn()) is called, that
> > function
> > grabs a spinlock and next calls kthread_create_on_node() with that
> > spinlock held.
> 
> Well, no, our request_timer_fn does nothing like that.
> At least it is not supposed to, and I don't see a code path
> leading to anything like that.
> 
> > kthread_create_on_node() can sleep which means that it is
> > not allowed to call that function with a spinlock held. I'm not
> > sure why
> > this particular behavior has not yet been noticed before. Can you
> > post
> > the first call stack from the e-mail at the start of this thread
> > together with the kernel version number on a DRBD-related mailing
> > list
> > and ask the DRBD authors for their opinion ?"
> > 
> > 
> > Thanks in advance
> 
> Traces:
> 
> > drbd TOP: Starting receiver thread (from swapper/0 [0])
> > BUG: scheduling while atomic: swapper/0/0/0x00000100
> > Modules linked in: iscsi_scst(O) ib_srpt(O) scst_vdisk(O) scst(O)
> > crc32c_intel drbd(O) libcrc32c autofs4 cpufreq_ondemand bonding
> > ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack
> > ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs
> > ib_umad
> > rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa uinput iTCO_wdt
> > iTCO_vendor_support raid0 raid10 microcode pcspkr i2c_i801 i2c_core
> > lpc_ich mfd_core sg e1000e ptp pps_core ib_qib dca ib_mad ib_core
> > acpi_cpufreq freq_table ext4 jbd2 mbcache raid1 sd_mod crc_t10dif
> > crct10dif_common ahci libahci video 3w_9xxx dm_mirror
> > dm_region_hash
> > dm_log dm_mod [last unloaded: drbd]
> > CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O 3.12.14 #1
> > Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS
> > S1200BT.86B.02.00.0035.030220120927 03/02/2012
> >  0000000000000000 ffff8802370039c0 ffffffff8153b1c2
> >  0000000000000000
> >  ffff880237012d00 ffff8802370039d0 ffffffff8107ac0b
> >  ffff880237003a70
> >  ffffffff8153d7dd ffff880237003a00 ffff880230e38480
> >  0000000000000024
> > Call Trace:
> >  <IRQ>  [<ffffffff8153b1c2>] dump_stack+0x49/0x5f
> >  [<ffffffff8107ac0b>] __schedule_bug+0x4b/0x60
> >  [<ffffffff8153d7dd>] __schedule+0x6ad/0x6f0
> >  [<ffffffff8153d8e9>] schedule+0x29/0x70
> >  [<ffffffff8153b375>] schedule_timeout+0x195/0x220
> >  [<ffffffff8153f6aa>] ? common_interrupt+0x6a/0x6a
> >  [<ffffffff8153dcb6>] wait_for_completion+0xc6/0x100
> >  [<ffffffff8107ff80>] ? try_to_wake_up+0x2c0/0x2c0
> >  [<ffffffff8106ee1d>] kthread_create_on_node+0x9d/0x120
> >  [<ffffffffa0533840>] ? drbd_destroy_device+0x180/0x180 [drbd]
> >  [<ffffffff8153b144>] ? printk+0x4d/0x4f
> >  [<ffffffff8105b176>] ? __flush_signals+0x36/0x40
> >  [<ffffffffa0533a7a>] drbd_thread_start+0x11a/0x290 [drbd]
> >  [<ffffffffa0533cd8>] _drbd_thread_stop+0xe8/0x100 [drbd]
> >  [<ffffffffa0544996>] __drbd_set_state+0xb86/0xcd0 [drbd]
> >  [<ffffffffa0525c89>] _drbd_set_state.clone.0+0x39/0x50 [drbd]
> >  [<ffffffffa0527f2c>] request_timer_fn+0x3fc/0x560 [drbd]
> >  [<ffffffff810668a0>] ? flush_delayed_work+0x40/0x40
> >  [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0
> >  [drbd]
> >  [<ffffffff810581d9>] call_timer_fn+0x49/0x120
> >  [<ffffffff81058db5>] run_timer_softirq+0x225/0x290
> >  [<ffffffff8126bd10>] ? timerqueue_add+0x60/0xb0
> >  [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0
> >  [drbd]
> >  [<ffffffff81050dbf>] __do_softirq+0xef/0x260
> >  [<ffffffff8107263b>] ? hrtimer_interrupt+0x15b/0x250
> >  [<ffffffff8154905c>] call_softirq+0x1c/0x30
> >  [<ffffffff81005805>] do_softirq+0x55/0x90
> >  [<ffffffff81050af5>] irq_exit+0xb5/0xc0
> >  [<ffffffff8154987a>] smp_apic_timer_interrupt+0x4a/0x5a
> >  [<ffffffff8154858a>] apic_timer_interrupt+0x6a/0x70
> >  <EOI>  [<ffffffff81434a6f>] ? cpuidle_enter_state+0x4f/0xd0
> >  [<ffffffff81434a6b>] ? cpuidle_enter_state+0x4b/0xd0
> >  [<ffffffff81434e6f>] cpuidle_idle_call+0xcf/0x160
> >  [<ffffffff8100c5ee>] arch_cpu_idle+0xe/0x30
> >  [<ffffffff8109ed5a>] cpu_idle_loop+0x9a/0x230
> >  [<ffffffff8109ef60>] cpu_startup_entry+0x70/0x80
> >  [<ffffffff81533f07>] rest_init+0x77/0x80
> >  [<ffffffff81b1017c>] start_kernel+0x3e3/0x3f0
> >  [<ffffffff81b0fbc7>] ? repair_env_string+0x5b/0x5b
> >  [<ffffffff81b0f59d>] x86_64_start_reservations+0x2a/0x2c
> >  [<ffffffff81b0f6d8>] x86_64_start_kernel+0x139/0x140
> > bad: scheduling from the idle thread!
> > CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W  O 3.12.14 #1
> > Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS
> > S1200BT.86B.02.00.0035.030220120927 03/02/2012
> >  ffffffff81a10480 ffff880237003970 ffffffff8153b1c2
> >  0000000000000000
> >  ffff880237012d00 ffff880237003990 ffffffff810829f9
> >  ffff8802370039a0
> >  ffff880237012d00 ffff8802370039c0 ffffffff8107add3
> >  ffff8802370039c0
> > Call Trace:
> >  <IRQ>  [<ffffffff8153b1c2>] dump_stack+0x49/0x5f
> >  [<ffffffff810829f9>] dequeue_task_idle+0x29/0x40
> >  [<ffffffff8107add3>] dequeue_task+0x93/0xb0
> >  [<ffffffff8107ae12>] deactivate_task+0x22/0x30
> >  [<ffffffff8153d5bc>] __schedule+0x48c/0x6f0
> >  [<ffffffff8153d8e9>] schedule+0x29/0x70
> >  [<ffffffff8153b375>] schedule_timeout+0x195/0x220
> >  [<ffffffff8153f6aa>] ? common_interrupt+0x6a/0x6a
> >  [<ffffffff8153dcb6>] wait_for_completion+0xc6/0x100
> >  [<ffffffff8107ff80>] ? try_to_wake_up+0x2c0/0x2c0
> >  [<ffffffff8106ee1d>] kthread_create_on_node+0x9d/0x120
> >  [<ffffffffa0533840>] ? drbd_destroy_device+0x180/0x180 [drbd]
> >  [<ffffffff8153b144>] ? printk+0x4d/0x4f
> >  [<ffffffff8105b176>] ? __flush_signals+0x36/0x40
> >  [<ffffffffa0533a7a>] drbd_thread_start+0x11a/0x290 [drbd]
> >  [<ffffffffa0533cd8>] _drbd_thread_stop+0xe8/0x100 [drbd]
> >  [<ffffffffa0544996>] __drbd_set_state+0xb86/0xcd0 [drbd]
> >  [<ffffffffa0525c89>] _drbd_set_state.clone.0+0x39/0x50 [drbd]
> >  [<ffffffffa0527f2c>] request_timer_fn+0x3fc/0x560 [drbd]
> >  [<ffffffff810668a0>] ? flush_delayed_work+0x40/0x40
> >  [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0
> >  [drbd]
> >  [<ffffffff810581d9>] call_timer_fn+0x49/0x120
> >  [<ffffffff81058db5>] run_timer_softirq+0x225/0x290
> >  [<ffffffff8126bd10>] ? timerqueue_add+0x60/0xb0
> >  [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0
> >  [drbd]
> >  [<ffffffff81050dbf>] __do_softirq+0xef/0x260
> >  [<ffffffff8107263b>] ? hrtimer_interrupt+0x15b/0x250
> >  [<ffffffff8154905c>] call_softirq+0x1c/0x30
> >  [<ffffffff81005805>] do_softirq+0x55/0x90
> >  [<ffffffff81050af5>] irq_exit+0xb5/0xc0
> >  [<ffffffff8154987a>] smp_apic_timer_interrupt+0x4a/0x5a
> >  [<ffffffff8154858a>] apic_timer_interrupt+0x6a/0x70
> >  <EOI>  [<ffffffff81434a6f>] ? cpuidle_enter_state+0x4f/0xd0
> >  [<ffffffff81434a6b>] ? cpuidle_enter_state+0x4b/0xd0
> >  [<ffffffff81434e6f>] cpuidle_idle_call+0xcf/0x160
> >  [<ffffffff8100c5ee>] arch_cpu_idle+0xe/0x30
> >  [<ffffffff8109ed5a>] cpu_idle_loop+0x9a/0x230
> >  [<ffffffff8109ef60>] cpu_startup_entry+0x70/0x80
> >  [<ffffffff81533f07>] rest_init+0x77/0x80
> >  [<ffffffff81b1017c>] start_kernel+0x3e3/0x3f0
> >  [<ffffffff81b0fbc7>] ? repair_env_string+0x5b/0x5b
> >  [<ffffffff81b0f59d>] x86_64_start_reservations+0x2a/0x2c
> >  [<ffffffff81b0f6d8>] x86_64_start_kernel+0x139/0x140
> > BUG: unable to handle kernel NULL pointer dereference at
> >           (null)
> > IP: [<          (null)>]           (null)
> > PGD 0
> > Oops: 0010 [#1] SMP
> > Modules linked in: iscsi_scst(O) ib_srpt(O) scst_vdisk(O) scst(O)
> > crc32c_intel drbd(O) libcrc32c autofs4 cpufreq_ondemand bonding
> > ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack
> > ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs
> > ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa uinput iTCO_wdt
> > iTCO_vendor_support raid0 raid10 microcode pcspkr i2c_i801
> > i2c_core lpc_ich mfd_core sg e1000e ptp pps_core ib_qib dca ib_mad
> > ib_core acpi_cpufreq freq_table ext4 jbd2 mbcache raid1 sd_mod
> > crc_t10dif crct10dif_common ahci libahci video 3w_9xxx dm_mirror
> > dm_region_hash dm_log dm_mod [last unloaded: drbd]
> > CPU: 7 PID: 52888 Comm: kthreadd Tainted: G        W  O 3.12.14 #1
> > Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS
> > S1200BT.86B.02.00.0035.030220120927 03/02/2012
> > task: ffff8800a3893540 ti: ffff8800771ac000 task.ti:
> > ffff8800771ac000
> > RIP: 0010:[<0000000000000000>]  [<          (null)>]
> >           (null)
> > RSP: 0018:ffff8800771add60  EFLAGS: 00010046
> > RAX: ffffffff81613440 RBX: ffff880237012d00 RCX: 000000000000b7a8
> > RDX: 0000000000000005 RSI: ffffffff81a10480 RDI: ffff880237012d00
> > RBP: ffff8800771add88 R08: 0000000000000000 R09: 00087a835b6f23c0
> > R10: ffffffff816133e5 R11: 0000000000000001 R12: ffffffff81a10480
> > R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000086
> > FS:  0000000000000000(0000) GS:ffff8802370e0000(0000)
> > knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000000 CR3: 0000000001a0b000 CR4: 00000000001407e0
> > Stack:
> >  ffffffff8107ae8a 00007f1bb02036a0 ffffffff81a10480
> >  ffff880237012d00
> >  ffffffff81a10c84 ffff8800771add98 ffffffff8107b992
> >  ffff8800771addb8
> >  ffffffff8107b9c1 ffff880237012d00 ffffffff81a10480
> >  ffff8800771addd8
> > Call Trace:
> >  [<ffffffff8107ae8a>] ? enqueue_task+0x6a/0x80
> >  [<ffffffff8107b992>] activate_task+0x22/0x30
> >  [<ffffffff8107b9c1>] ttwu_activate+0x21/0x50
> >  [<ffffffff8107c2bc>] ttwu_do_activate.clone.0+0x3c/0x60
> >  [<ffffffff8107fec6>] try_to_wake_up+0x206/0x2c0
> >  [<ffffffff8107ff92>] default_wake_function+0x12/0x20
> >  [<ffffffff81077659>] __wake_up_common+0x59/0x90
> >  [<ffffffffa0533840>] ? drbd_destroy_device+0x180/0x180 [drbd]
> >  [<ffffffff81079666>] complete+0x46/0x60
> >  [<ffffffff8106e97e>] kthread+0x9e/0xe0
> >  [<ffffffff8106e8e0>] ? kthread_freezable_should_stop+0x70/0x70
> >  [<ffffffff8154792c>] ret_from_fork+0x7c/0xb0
> >  [<ffffffff8106e8e0>] ? kthread_freezable_should_stop+0x70/0x70
> > Code:  Bad RIP value.
> > RIP  [<          (null)>]           (null)
> >  RSP <ffff8800771add60>
> > CR2: 0000000000000000
> > ---[ end trace f6ab84a85de85d6c ]---
> > 
> > Kernel panic - not syncing: Fatal exception
> 
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list   --   I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
> 



More information about the drbd-user mailing list