[DRBD-user] strange drbd bug

Lars Ellenberg lars.ellenberg at linbit.com
Tue Oct 14 17:08:52 CEST 2014

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Tue, Oct 14, 2014 at 03:27:56PM +0200, Matteo Tescione wrote:
> Hi all,
> 
> after almost 1 month I noticed the following strange BUG:

exact DRBD version?
in tree, out of tree?
kernel? patches applied? which ones?

DRBD Configuration and IO stack?
What do you do to trigger it?
How easy can you reproduce?
Logs leading up to this?

The stack traces look unusual.
You happen to try to trick DRBD into doing something it cannot do?

> Discussed this topic with Bart from SCST, he suggested:
> 
> 
> "To me this looks like a DRBD bug. If a timeout of a DRBD request occurs
> the DRBD timeout handler (request_timer_fn()) is called, that function
> grabs a spinlock and next calls kthread_create_on_node() with that
> spinlock held.

Well, no, our request_timer_fn does nothing like that.
At least it is not supposed to, and I don't see a code path
leading to anything like that.

> kthread_create_on_node() can sleep which means that it is
> not allowed to call that function with a spinlock held. I'm not sure why
> this particular behavior has not yet been noticed before. Can you post
> the first call stack from the e-mail at the start of this thread
> together with the kernel version number on a DRBD-related mailing list
> and ask the DRBD authors for their opinion ?"
> 
> 
> Thanks in advance

Traces:

> drbd TOP: Starting receiver thread (from swapper/0 [0])
> BUG: scheduling while atomic: swapper/0/0/0x00000100
> Modules linked in: iscsi_scst(O) ib_srpt(O) scst_vdisk(O) scst(O)
> crc32c_intel drbd(O) libcrc32c autofs4 cpufreq_ondemand bonding
> ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack
> ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
> rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa uinput iTCO_wdt
> iTCO_vendor_support raid0 raid10 microcode pcspkr i2c_i801 i2c_core
> lpc_ich mfd_core sg e1000e ptp pps_core ib_qib dca ib_mad ib_core
> acpi_cpufreq freq_table ext4 jbd2 mbcache raid1 sd_mod crc_t10dif
> crct10dif_common ahci libahci video 3w_9xxx dm_mirror dm_region_hash
> dm_log dm_mod [last unloaded: drbd]
> CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O 3.12.14 #1
> Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS S1200BT.86B.02.00.0035.030220120927 03/02/2012
>  0000000000000000 ffff8802370039c0 ffffffff8153b1c2 0000000000000000
>  ffff880237012d00 ffff8802370039d0 ffffffff8107ac0b ffff880237003a70
>  ffffffff8153d7dd ffff880237003a00 ffff880230e38480 0000000000000024
> Call Trace:
>  <IRQ>  [<ffffffff8153b1c2>] dump_stack+0x49/0x5f
>  [<ffffffff8107ac0b>] __schedule_bug+0x4b/0x60
>  [<ffffffff8153d7dd>] __schedule+0x6ad/0x6f0
>  [<ffffffff8153d8e9>] schedule+0x29/0x70
>  [<ffffffff8153b375>] schedule_timeout+0x195/0x220
>  [<ffffffff8153f6aa>] ? common_interrupt+0x6a/0x6a
>  [<ffffffff8153dcb6>] wait_for_completion+0xc6/0x100
>  [<ffffffff8107ff80>] ? try_to_wake_up+0x2c0/0x2c0
>  [<ffffffff8106ee1d>] kthread_create_on_node+0x9d/0x120
>  [<ffffffffa0533840>] ? drbd_destroy_device+0x180/0x180 [drbd]
>  [<ffffffff8153b144>] ? printk+0x4d/0x4f
>  [<ffffffff8105b176>] ? __flush_signals+0x36/0x40
>  [<ffffffffa0533a7a>] drbd_thread_start+0x11a/0x290 [drbd]
>  [<ffffffffa0533cd8>] _drbd_thread_stop+0xe8/0x100 [drbd]
>  [<ffffffffa0544996>] __drbd_set_state+0xb86/0xcd0 [drbd]
>  [<ffffffffa0525c89>] _drbd_set_state.clone.0+0x39/0x50 [drbd]
>  [<ffffffffa0527f2c>] request_timer_fn+0x3fc/0x560 [drbd]
>  [<ffffffff810668a0>] ? flush_delayed_work+0x40/0x40
>  [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0 [drbd]
>  [<ffffffff810581d9>] call_timer_fn+0x49/0x120
>  [<ffffffff81058db5>] run_timer_softirq+0x225/0x290
>  [<ffffffff8126bd10>] ? timerqueue_add+0x60/0xb0
>  [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0 [drbd]
>  [<ffffffff81050dbf>] __do_softirq+0xef/0x260
>  [<ffffffff8107263b>] ? hrtimer_interrupt+0x15b/0x250
>  [<ffffffff8154905c>] call_softirq+0x1c/0x30
>  [<ffffffff81005805>] do_softirq+0x55/0x90
>  [<ffffffff81050af5>] irq_exit+0xb5/0xc0
>  [<ffffffff8154987a>] smp_apic_timer_interrupt+0x4a/0x5a
>  [<ffffffff8154858a>] apic_timer_interrupt+0x6a/0x70
>  <EOI>  [<ffffffff81434a6f>] ? cpuidle_enter_state+0x4f/0xd0
>  [<ffffffff81434a6b>] ? cpuidle_enter_state+0x4b/0xd0
>  [<ffffffff81434e6f>] cpuidle_idle_call+0xcf/0x160
>  [<ffffffff8100c5ee>] arch_cpu_idle+0xe/0x30
>  [<ffffffff8109ed5a>] cpu_idle_loop+0x9a/0x230
>  [<ffffffff8109ef60>] cpu_startup_entry+0x70/0x80
>  [<ffffffff81533f07>] rest_init+0x77/0x80
>  [<ffffffff81b1017c>] start_kernel+0x3e3/0x3f0
>  [<ffffffff81b0fbc7>] ? repair_env_string+0x5b/0x5b
>  [<ffffffff81b0f59d>] x86_64_start_reservations+0x2a/0x2c
>  [<ffffffff81b0f6d8>] x86_64_start_kernel+0x139/0x140
> bad: scheduling from the idle thread!
> CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W  O 3.12.14 #1
> Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS S1200BT.86B.02.00.0035.030220120927 03/02/2012
>  ffffffff81a10480 ffff880237003970 ffffffff8153b1c2 0000000000000000
>  ffff880237012d00 ffff880237003990 ffffffff810829f9 ffff8802370039a0
>  ffff880237012d00 ffff8802370039c0 ffffffff8107add3 ffff8802370039c0
> Call Trace:
>  <IRQ>  [<ffffffff8153b1c2>] dump_stack+0x49/0x5f
>  [<ffffffff810829f9>] dequeue_task_idle+0x29/0x40
>  [<ffffffff8107add3>] dequeue_task+0x93/0xb0
>  [<ffffffff8107ae12>] deactivate_task+0x22/0x30
>  [<ffffffff8153d5bc>] __schedule+0x48c/0x6f0
>  [<ffffffff8153d8e9>] schedule+0x29/0x70
>  [<ffffffff8153b375>] schedule_timeout+0x195/0x220
>  [<ffffffff8153f6aa>] ? common_interrupt+0x6a/0x6a
>  [<ffffffff8153dcb6>] wait_for_completion+0xc6/0x100
>  [<ffffffff8107ff80>] ? try_to_wake_up+0x2c0/0x2c0
>  [<ffffffff8106ee1d>] kthread_create_on_node+0x9d/0x120
>  [<ffffffffa0533840>] ? drbd_destroy_device+0x180/0x180 [drbd]
>  [<ffffffff8153b144>] ? printk+0x4d/0x4f
>  [<ffffffff8105b176>] ? __flush_signals+0x36/0x40
>  [<ffffffffa0533a7a>] drbd_thread_start+0x11a/0x290 [drbd]
>  [<ffffffffa0533cd8>] _drbd_thread_stop+0xe8/0x100 [drbd]
>  [<ffffffffa0544996>] __drbd_set_state+0xb86/0xcd0 [drbd]
>  [<ffffffffa0525c89>] _drbd_set_state.clone.0+0x39/0x50 [drbd]
>  [<ffffffffa0527f2c>] request_timer_fn+0x3fc/0x560 [drbd]
>  [<ffffffff810668a0>] ? flush_delayed_work+0x40/0x40
>  [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0 [drbd]
>  [<ffffffff810581d9>] call_timer_fn+0x49/0x120
>  [<ffffffff81058db5>] run_timer_softirq+0x225/0x290
>  [<ffffffff8126bd10>] ? timerqueue_add+0x60/0xb0
>  [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0 [drbd]
>  [<ffffffff81050dbf>] __do_softirq+0xef/0x260
>  [<ffffffff8107263b>] ? hrtimer_interrupt+0x15b/0x250
>  [<ffffffff8154905c>] call_softirq+0x1c/0x30
>  [<ffffffff81005805>] do_softirq+0x55/0x90
>  [<ffffffff81050af5>] irq_exit+0xb5/0xc0
>  [<ffffffff8154987a>] smp_apic_timer_interrupt+0x4a/0x5a
>  [<ffffffff8154858a>] apic_timer_interrupt+0x6a/0x70
>  <EOI>  [<ffffffff81434a6f>] ? cpuidle_enter_state+0x4f/0xd0
>  [<ffffffff81434a6b>] ? cpuidle_enter_state+0x4b/0xd0
>  [<ffffffff81434e6f>] cpuidle_idle_call+0xcf/0x160
>  [<ffffffff8100c5ee>] arch_cpu_idle+0xe/0x30
>  [<ffffffff8109ed5a>] cpu_idle_loop+0x9a/0x230
>  [<ffffffff8109ef60>] cpu_startup_entry+0x70/0x80
>  [<ffffffff81533f07>] rest_init+0x77/0x80
>  [<ffffffff81b1017c>] start_kernel+0x3e3/0x3f0
>  [<ffffffff81b0fbc7>] ? repair_env_string+0x5b/0x5b
>  [<ffffffff81b0f59d>] x86_64_start_reservations+0x2a/0x2c
>  [<ffffffff81b0f6d8>] x86_64_start_kernel+0x139/0x140
> BUG: unable to handle kernel NULL pointer dereference at           (null)
> IP: [<          (null)>]           (null)
> PGD 0
> Oops: 0010 [#1] SMP
> Modules linked in: iscsi_scst(O) ib_srpt(O) scst_vdisk(O) scst(O) crc32c_intel drbd(O) libcrc32c autofs4 cpufreq_ondemand bonding ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa uinput iTCO_wdt iTCO_vendor_support raid0 raid10 microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core sg e1000e ptp pps_core ib_qib dca ib_mad ib_core acpi_cpufreq freq_table ext4 jbd2 mbcache raid1 sd_mod crc_t10dif crct10dif_common ahci libahci video 3w_9xxx dm_mirror dm_region_hash dm_log dm_mod [last unloaded: drbd]
> CPU: 7 PID: 52888 Comm: kthreadd Tainted: G        W  O 3.12.14 #1
> Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS S1200BT.86B.02.00.0035.030220120927 03/02/2012
> task: ffff8800a3893540 ti: ffff8800771ac000 task.ti: ffff8800771ac000
> RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
> RSP: 0018:ffff8800771add60  EFLAGS: 00010046
> RAX: ffffffff81613440 RBX: ffff880237012d00 RCX: 000000000000b7a8
> RDX: 0000000000000005 RSI: ffffffff81a10480 RDI: ffff880237012d00
> RBP: ffff8800771add88 R08: 0000000000000000 R09: 00087a835b6f23c0
> R10: ffffffff816133e5 R11: 0000000000000001 R12: ffffffff81a10480
> R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000086
> FS:  0000000000000000(0000) GS:ffff8802370e0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 0000000001a0b000 CR4: 00000000001407e0
> Stack:
>  ffffffff8107ae8a 00007f1bb02036a0 ffffffff81a10480 ffff880237012d00
>  ffffffff81a10c84 ffff8800771add98 ffffffff8107b992 ffff8800771addb8
>  ffffffff8107b9c1 ffff880237012d00 ffffffff81a10480 ffff8800771addd8
> Call Trace:
>  [<ffffffff8107ae8a>] ? enqueue_task+0x6a/0x80
>  [<ffffffff8107b992>] activate_task+0x22/0x30
>  [<ffffffff8107b9c1>] ttwu_activate+0x21/0x50
>  [<ffffffff8107c2bc>] ttwu_do_activate.clone.0+0x3c/0x60
>  [<ffffffff8107fec6>] try_to_wake_up+0x206/0x2c0
>  [<ffffffff8107ff92>] default_wake_function+0x12/0x20
>  [<ffffffff81077659>] __wake_up_common+0x59/0x90
>  [<ffffffffa0533840>] ? drbd_destroy_device+0x180/0x180 [drbd]
>  [<ffffffff81079666>] complete+0x46/0x60
>  [<ffffffff8106e97e>] kthread+0x9e/0xe0
>  [<ffffffff8106e8e0>] ? kthread_freezable_should_stop+0x70/0x70
>  [<ffffffff8154792c>] ret_from_fork+0x7c/0xb0
>  [<ffffffff8106e8e0>] ? kthread_freezable_should_stop+0x70/0x70
> Code:  Bad RIP value.
> RIP  [<          (null)>]           (null)
>  RSP <ffff8800771add60>
> CR2: 0000000000000000
> ---[ end trace f6ab84a85de85d6c ]---
> 
> Kernel panic - not syncing: Fatal exception

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list