[DRBD-user] strange drbd bug

Matteo Tescione matteo at RMnet.it
Tue Oct 14 15:27:56 CEST 2014

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi all,

after almost 1 month I noticed the following strange BUG:


drbd TOP: Starting receiver thread (from swapper/0 [0])

BUG: scheduling while atomic: swapper/0/0/0x00000100

Modules linked in: iscsi_scst(O) ib_srpt(O) scst_vdisk(O) scst(O) crc32c_intel drbd(O) libcrc32c autofs4 cpufreq_ondemand bonding ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa uinput iTCO_wdt iTCO_vendor_support raid0 raid10 microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core sg e1000e ptp pps_core ib_qib dca ib_mad ib_core acpi_cpufreq freq_table ext4 jbd2 mbcache raid1 sd_mod crc_t10dif crct10dif_common ahci libahci video 3w_9xxx dm_mirror dm_region_hash dm_log dm_mod [last unloaded: drbd]

CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O 3.12.14 #1

Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS S1200BT.86B.02.00.0035.030220120927 03/02/2012

 0000000000000000 ffff8802370039c0 ffffffff8153b1c2 0000000000000000

 ffff880237012d00 ffff8802370039d0 ffffffff8107ac0b ffff880237003a70

 ffffffff8153d7dd ffff880237003a00 ffff880230e38480 0000000000000024

Call Trace:

 <IRQ>  [<ffffffff8153b1c2>] dump_stack+0x49/0x5f

 [<ffffffff8107ac0b>] __schedule_bug+0x4b/0x60

 [<ffffffff8153d7dd>] __schedule+0x6ad/0x6f0

 [<ffffffff8153d8e9>] schedule+0x29/0x70

 [<ffffffff8153b375>] schedule_timeout+0x195/0x220

 [<ffffffff8153f6aa>] ? common_interrupt+0x6a/0x6a

 [<ffffffff8153dcb6>] wait_for_completion+0xc6/0x100

 [<ffffffff8107ff80>] ? try_to_wake_up+0x2c0/0x2c0

 [<ffffffff8106ee1d>] kthread_create_on_node+0x9d/0x120

 [<ffffffffa0533840>] ? drbd_destroy_device+0x180/0x180 [drbd]

 [<ffffffff8153b144>] ? printk+0x4d/0x4f

 [<ffffffff8105b176>] ? __flush_signals+0x36/0x40

 [<ffffffffa0533a7a>] drbd_thread_start+0x11a/0x290 [drbd]

 [<ffffffffa0533cd8>] _drbd_thread_stop+0xe8/0x100 [drbd]

 [<ffffffffa0544996>] __drbd_set_state+0xb86/0xcd0 [drbd]

 [<ffffffffa0525c89>] _drbd_set_state.clone.0+0x39/0x50 [drbd]

 [<ffffffffa0527f2c>] request_timer_fn+0x3fc/0x560 [drbd]

 [<ffffffff810668a0>] ? flush_delayed_work+0x40/0x40

 [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0 [drbd]

 [<ffffffff810581d9>] call_timer_fn+0x49/0x120

 [<ffffffff81058db5>] run_timer_softirq+0x225/0x290

 [<ffffffff8126bd10>] ? timerqueue_add+0x60/0xb0

 [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0 [drbd]

 [<ffffffff81050dbf>] __do_softirq+0xef/0x260

 [<ffffffff8107263b>] ? hrtimer_interrupt+0x15b/0x250

 [<ffffffff8154905c>] call_softirq+0x1c/0x30

 [<ffffffff81005805>] do_softirq+0x55/0x90

 [<ffffffff81050af5>] irq_exit+0xb5/0xc0

 [<ffffffff8154987a>] smp_apic_timer_interrupt+0x4a/0x5a

 [<ffffffff8154858a>] apic_timer_interrupt+0x6a/0x70

 <EOI>  [<ffffffff81434a6f>] ? cpuidle_enter_state+0x4f/0xd0

 [<ffffffff81434a6b>] ? cpuidle_enter_state+0x4b/0xd0

 [<ffffffff81434e6f>] cpuidle_idle_call+0xcf/0x160

 [<ffffffff8100c5ee>] arch_cpu_idle+0xe/0x30

 [<ffffffff8109ed5a>] cpu_idle_loop+0x9a/0x230

 [<ffffffff8109ef60>] cpu_startup_entry+0x70/0x80

 [<ffffffff81533f07>] rest_init+0x77/0x80

 [<ffffffff81b1017c>] start_kernel+0x3e3/0x3f0

 [<ffffffff81b0fbc7>] ? repair_env_string+0x5b/0x5b

 [<ffffffff81b0f59d>] x86_64_start_reservations+0x2a/0x2c

 [<ffffffff81b0f6d8>] x86_64_start_kernel+0x139/0x140

bad: scheduling from the idle thread!

CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W  O 3.12.14 #1

Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS S1200BT.86B.02.00.0035.030220120927 03/02/2012

 ffffffff81a10480 ffff880237003970 ffffffff8153b1c2 0000000000000000

 ffff880237012d00 ffff880237003990 ffffffff810829f9 ffff8802370039a0

 ffff880237012d00 ffff8802370039c0 ffffffff8107add3 ffff8802370039c0

Call Trace:

 <IRQ>  [<ffffffff8153b1c2>] dump_stack+0x49/0x5f

 [<ffffffff810829f9>] dequeue_task_idle+0x29/0x40

 [<ffffffff8107add3>] dequeue_task+0x93/0xb0

 [<ffffffff8107ae12>] deactivate_task+0x22/0x30

 [<ffffffff8153d5bc>] __schedule+0x48c/0x6f0

 [<ffffffff8153d8e9>] schedule+0x29/0x70

 [<ffffffff8153b375>] schedule_timeout+0x195/0x220

 [<ffffffff8153f6aa>] ? common_interrupt+0x6a/0x6a

 [<ffffffff8153dcb6>] wait_for_completion+0xc6/0x100

 [<ffffffff8107ff80>] ? try_to_wake_up+0x2c0/0x2c0

 [<ffffffff8106ee1d>] kthread_create_on_node+0x9d/0x120

 [<ffffffffa0533840>] ? drbd_destroy_device+0x180/0x180 [drbd]

 [<ffffffff8153b144>] ? printk+0x4d/0x4f

 [<ffffffff8105b176>] ? __flush_signals+0x36/0x40

 [<ffffffffa0533a7a>] drbd_thread_start+0x11a/0x290 [drbd]

 [<ffffffffa0533cd8>] _drbd_thread_stop+0xe8/0x100 [drbd]

 [<ffffffffa0544996>] __drbd_set_state+0xb86/0xcd0 [drbd]

 [<ffffffffa0525c89>] _drbd_set_state.clone.0+0x39/0x50 [drbd]

 [<ffffffffa0527f2c>] request_timer_fn+0x3fc/0x560 [drbd]

 [<ffffffff810668a0>] ? flush_delayed_work+0x40/0x40

 [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0 [drbd]

 [<ffffffff810581d9>] call_timer_fn+0x49/0x120

 [<ffffffff81058db5>] run_timer_softirq+0x225/0x290

 [<ffffffff8126bd10>] ? timerqueue_add+0x60/0xb0

 [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0 [drbd]

 [<ffffffff81050dbf>] __do_softirq+0xef/0x260

 [<ffffffff8107263b>] ? hrtimer_interrupt+0x15b/0x250

 [<ffffffff8154905c>] call_softirq+0x1c/0x30

 [<ffffffff81005805>] do_softirq+0x55/0x90

 [<ffffffff81050af5>] irq_exit+0xb5/0xc0

 [<ffffffff8154987a>] smp_apic_timer_interrupt+0x4a/0x5a

 [<ffffffff8154858a>] apic_timer_interrupt+0x6a/0x70

 <EOI>  [<ffffffff81434a6f>] ? cpuidle_enter_state+0x4f/0xd0

 [<ffffffff81434a6b>] ? cpuidle_enter_state+0x4b/0xd0

 [<ffffffff81434e6f>] cpuidle_idle_call+0xcf/0x160

 [<ffffffff8100c5ee>] arch_cpu_idle+0xe/0x30

 [<ffffffff8109ed5a>] cpu_idle_loop+0x9a/0x230

 [<ffffffff8109ef60>] cpu_startup_entry+0x70/0x80

 [<ffffffff81533f07>] rest_init+0x77/0x80

 [<ffffffff81b1017c>] start_kernel+0x3e3/0x3f0

 [<ffffffff81b0fbc7>] ? repair_env_string+0x5b/0x5b

 [<ffffffff81b0f59d>] x86_64_start_reservations+0x2a/0x2c

 [<ffffffff81b0f6d8>] x86_64_start_kernel+0x139/0x140

BUG: unable to handle kernel NULL pointer dereference at           (null)

IP: [<          (null)>]           (null)

PGD 0

Oops: 0010 [#1] SMP

Modules linked in: iscsi_scst(O) ib_srpt(O) scst_vdisk(O) scst(O) crc32c_intel drbd(O) libcrc32c autofs4 cpufreq_ondemand bonding ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa uinput iTCO_wdt iTCO_vendor_support raid0 raid10 microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core sg e1000e ptp pps_core ib_qib dca ib_mad ib_core acpi_cpufreq freq_table ext4 jbd2 mbcache raid1 sd_mod crc_t10dif crct10dif_common ahci libahci video 3w_9xxx dm_mirror dm_region_hash dm_log dm_mod [last unloaded: drbd]

CPU: 7 PID: 52888 Comm: kthreadd Tainted: G        W  O 3.12.14 #1

Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS S1200BT.86B.02.00.0035.030220120927 03/02/2012

task: ffff8800a3893540 ti: ffff8800771ac000 task.ti: ffff8800771ac000

RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)

RSP: 0018:ffff8800771add60  EFLAGS: 00010046

RAX: ffffffff81613440 RBX: ffff880237012d00 RCX: 000000000000b7a8

RDX: 0000000000000005 RSI: ffffffff81a10480 RDI: ffff880237012d00

RBP: ffff8800771add88 R08: 0000000000000000 R09: 00087a835b6f23c0

R10: ffffffff816133e5 R11: 0000000000000001 R12: ffffffff81a10480

R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000086

FS:  0000000000000000(0000) GS:ffff8802370e0000(0000) knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000000000000 CR3: 0000000001a0b000 CR4: 00000000001407e0

Stack:

 ffffffff8107ae8a 00007f1bb02036a0 ffffffff81a10480 ffff880237012d00

 ffffffff81a10c84 ffff8800771add98 ffffffff8107b992 ffff8800771addb8

 ffffffff8107b9c1 ffff880237012d00 ffffffff81a10480 ffff8800771addd8

Call Trace:

 [<ffffffff8107ae8a>] ? enqueue_task+0x6a/0x80

 [<ffffffff8107b992>] activate_task+0x22/0x30

 [<ffffffff8107b9c1>] ttwu_activate+0x21/0x50

 [<ffffffff8107c2bc>] ttwu_do_activate.clone.0+0x3c/0x60

 [<ffffffff8107fec6>] try_to_wake_up+0x206/0x2c0

 [<ffffffff8107ff92>] default_wake_function+0x12/0x20

 [<ffffffff81077659>] __wake_up_common+0x59/0x90

 [<ffffffffa0533840>] ? drbd_destroy_device+0x180/0x180 [drbd]

 [<ffffffff81079666>] complete+0x46/0x60

 [<ffffffff8106e97e>] kthread+0x9e/0xe0

 [<ffffffff8106e8e0>] ? kthread_freezable_should_stop+0x70/0x70

 [<ffffffff8154792c>] ret_from_fork+0x7c/0xb0

 [<ffffffff8106e8e0>] ? kthread_freezable_should_stop+0x70/0x70

Code:  Bad RIP value.

RIP  [<          (null)>]           (null)

 RSP <ffff8800771add60>

CR2: 0000000000000000

---[ end trace f6ab84a85de85d6c ]---

Kernel panic - not syncing: Fatal exception


Discussed this topic with Bart from SCST, he suggested:


"To me this looks like a DRBD bug. If a timeout of a DRBD request occurs
the DRBD timeout handler (request_timer_fn()) is called, that function
grabs a spinlock and next calls kthread_create_on_node() with that
spinlock held. kthread_create_on_node() can sleep which means that it is
not allowed to call that function with a spinlock held. I'm not sure why
this particular behavior has not yet been noticed before. Can you post
the first call stack from the e-mail at the start of this thread
together with the kernel version number on a DRBD-related mailing list
and ask the DRBD authors for their opinion ?"


Thanks in advance

--
#Matteo Tescione
#RMnet srl




More information about the drbd-user mailing list