Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars, > exact DRBD version? > in tree, out of tree? > kernel? patches applied? which ones? > > DRBD Configuration and IO stack? > What do you do to trigger it? > How easy can you reproduce? > Logs leading up to this? 8.4.4 downloaded from official repository, compiled out of tree. Kernel is 3.12.14 with only scst req_fifo patch & iscsi_putpage_callback patch applied. I do nothing, the 2 box are interconnected with 40-gbit back-to-back infiniband qle7342, but I seen several disconnection event repeating random even with no IO, and noticed that once in a month the primary box crashes. So I turned on the console logging and captured the trace you see below. As I said, occurs only once in 30-40 days. Logs before this panic: drbd TOP: Restarting receiver thread drbd TOP: receiver (re)started drbd TOP: conn( Unconnected -> WFConnection ) drbd TOP: helper command: /sbin/drbdadm fence-peer TOP exit code 4 (0x400) drbd TOP: fence-peer helper returned 4 (peer was fenced) drbd TOP: pdsk( DUnknown -> Outdated ) drbd TOP: Handshake successful: Agreed network protocol version 101 drbd TOP: Agreed to support TRIM on protocol level drbd TOP: Peer authenticated using 20 bytes HMAC drbd TOP: conn( WFConnection -> WFReportParams ) drbd TOP: Starting asender thread (from drbd_r_TOP [4385]) block drbd0: drbd_sync_handshake: block drbd0: self 34B181D069424F63:3921A73508F919D7:673BF468C9830995:673AF468C9830995 bits:59293 flags:0 block drbd0: peer 3921A73508F919D6:0000000000000000:673BF468C9830994:673AF468C9830995 bits:0 flags:0 block drbd0: uuid_compare()=1 by rule 70 block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent ) block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0% block drbd1: drbd_sync_handshake: block drbd1: self D532468B0E14258D:344C01173D6B201D:659C03CE91D88EBB:659B03CE91D88EBB bits:44 flags:0 block drbd1: peer 344C01173D6B201C:0000000000000000:659C03CE91D88EBA:659B03CE91D88EBB bits:0 flags:0 block drbd1: uuid_compare()=1 by rule 70 block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent ) block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0% block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0% block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) block drbd0: Began resync as SyncSource (will sync 237172 KB [59293 bits set]). block drbd0: updated sync UUID 34B181D069424F63:3922A73508F919D7:3921A73508F919D7:673BF468C9830995 block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0% block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0) block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) block drbd1: Began resync as SyncSource (will sync 196 KB [49 bits set]). block drbd1: updated sync UUID D532468B0E14258D:344D01173D6B201D:344C01173D6B201D:659C03CE91D88EBB block drbd1: Resync done (total 1 sec; paused 0 sec; 196 K/sec) block drbd1: updated UUIDs D532468B0E14258D:0000000000000000:344D01173D6B201D:344C01173D6B201D block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) block drbd0: Resync done (total 6 sec; paused 0 sec; 39528 K/sec) block drbd0: updated UUIDs 34B181D069424F63:0000000000000000:3922A73508F919D7:3921A73508F919D7 block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) block drbd1: Remote failed to finish a request within ko-count * timeout block drbd1: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) drbd TOP: asender terminated drbd TOP: Terminating drbd_a_TOP block drbd1: md_sync_timer expired! Worker calls drbd_md_sync(). block drbd1: new current UUID 3C68CD64811C1C41:D532468B0E14258D:344D01173D6B201D:344C01173D6B201D drbd TOP: Connection closed block drbd0: State change failed: Refusing to be Primary while peer is not outdated block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } block drbd0: wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- } drbd TOP: State change failed: Refusing to be Primary while peer is not outdated drbd TOP: mask = 0x1f0 val = 0x20 drbd TOP: old_conn:WFReportParams wanted_conn:Unconnected drbd TOP: Expected cstate < C_WF_REPORT_PARAMS drbd TOP: receiver terminated drbd TOP: Restarting receiver thread drbd TOP: receiver (re)started block drbd0: State change failed: Refusing to be Primary while peer is not outdated block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } block drbd0: wanted = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown r----- } drbd TOP: State change failed: Refusing to be Primary while peer is not outdated drbd TOP: mask = 0x1f0 val = 0x80 drbd TOP: old_conn:WFReportParams wanted_conn:WFConnection drbd TOP: Connection closed block drbd0: State change failed: Refusing to be Primary while peer is not outdated block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } block drbd0: wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- } drbd TOP: State change failed: Refusing to be Primary while peer is not outdated drbd TOP: mask = 0x1f0 val = 0x20 drbd TOP: old_conn:WFReportParams wanted_conn:Unconnected drbd TOP: Expected cstate < C_WF_REPORT_PARAMS drbd TOP: receiver terminated drbd TOP: Terminating drbd_r_TOP block drbd0: Remote failed to finish a request within ko-count * timeout block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) drbd TOP: Starting receiver thread (from swapper/0 [0]) BUG: scheduling while atomic: swapper/0/0/0x00000100 Best Regards, -- matteo > The stack traces look unusual. > You happen to try to trick DRBD into doing something it cannot do? > > > Discussed this topic with Bart from SCST, he suggested: > > > > > > "To me this looks like a DRBD bug. If a timeout of a DRBD request > > occurs > > the DRBD timeout handler (request_timer_fn()) is called, that > > function > > grabs a spinlock and next calls kthread_create_on_node() with that > > spinlock held. > > Well, no, our request_timer_fn does nothing like that. > At least it is not supposed to, and I don't see a code path > leading to anything like that. > > > kthread_create_on_node() can sleep which means that it is > > not allowed to call that function with a spinlock held. I'm not > > sure why > > this particular behavior has not yet been noticed before. Can you > > post > > the first call stack from the e-mail at the start of this thread > > together with the kernel version number on a DRBD-related mailing > > list > > and ask the DRBD authors for their opinion ?" > > > > > > Thanks in advance > > Traces: > > > drbd TOP: Starting receiver thread (from swapper/0 [0]) > > BUG: scheduling while atomic: swapper/0/0/0x00000100 > > Modules linked in: iscsi_scst(O) ib_srpt(O) scst_vdisk(O) scst(O) > > crc32c_intel drbd(O) libcrc32c autofs4 cpufreq_ondemand bonding > > ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack > > ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs > > ib_umad > > rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa uinput iTCO_wdt > > iTCO_vendor_support raid0 raid10 microcode pcspkr i2c_i801 i2c_core > > lpc_ich mfd_core sg e1000e ptp pps_core ib_qib dca ib_mad ib_core > > acpi_cpufreq freq_table ext4 jbd2 mbcache raid1 sd_mod crc_t10dif > > crct10dif_common ahci libahci video 3w_9xxx dm_mirror > > dm_region_hash > > dm_log dm_mod [last unloaded: drbd] > > CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 3.12.14 #1 > > Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS > > S1200BT.86B.02.00.0035.030220120927 03/02/2012 > > 0000000000000000 ffff8802370039c0 ffffffff8153b1c2 > > 0000000000000000 > > ffff880237012d00 ffff8802370039d0 ffffffff8107ac0b > > ffff880237003a70 > > ffffffff8153d7dd ffff880237003a00 ffff880230e38480 > > 0000000000000024 > > Call Trace: > > <IRQ> [<ffffffff8153b1c2>] dump_stack+0x49/0x5f > > [<ffffffff8107ac0b>] __schedule_bug+0x4b/0x60 > > [<ffffffff8153d7dd>] __schedule+0x6ad/0x6f0 > > [<ffffffff8153d8e9>] schedule+0x29/0x70 > > [<ffffffff8153b375>] schedule_timeout+0x195/0x220 > > [<ffffffff8153f6aa>] ? common_interrupt+0x6a/0x6a > > [<ffffffff8153dcb6>] wait_for_completion+0xc6/0x100 > > [<ffffffff8107ff80>] ? try_to_wake_up+0x2c0/0x2c0 > > [<ffffffff8106ee1d>] kthread_create_on_node+0x9d/0x120 > > [<ffffffffa0533840>] ? drbd_destroy_device+0x180/0x180 [drbd] > > [<ffffffff8153b144>] ? printk+0x4d/0x4f > > [<ffffffff8105b176>] ? __flush_signals+0x36/0x40 > > [<ffffffffa0533a7a>] drbd_thread_start+0x11a/0x290 [drbd] > > [<ffffffffa0533cd8>] _drbd_thread_stop+0xe8/0x100 [drbd] > > [<ffffffffa0544996>] __drbd_set_state+0xb86/0xcd0 [drbd] > > [<ffffffffa0525c89>] _drbd_set_state.clone.0+0x39/0x50 [drbd] > > [<ffffffffa0527f2c>] request_timer_fn+0x3fc/0x560 [drbd] > > [<ffffffff810668a0>] ? flush_delayed_work+0x40/0x40 > > [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0 > > [drbd] > > [<ffffffff810581d9>] call_timer_fn+0x49/0x120 > > [<ffffffff81058db5>] run_timer_softirq+0x225/0x290 > > [<ffffffff8126bd10>] ? timerqueue_add+0x60/0xb0 > > [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0 > > [drbd] > > [<ffffffff81050dbf>] __do_softirq+0xef/0x260 > > [<ffffffff8107263b>] ? hrtimer_interrupt+0x15b/0x250 > > [<ffffffff8154905c>] call_softirq+0x1c/0x30 > > [<ffffffff81005805>] do_softirq+0x55/0x90 > > [<ffffffff81050af5>] irq_exit+0xb5/0xc0 > > [<ffffffff8154987a>] smp_apic_timer_interrupt+0x4a/0x5a > > [<ffffffff8154858a>] apic_timer_interrupt+0x6a/0x70 > > <EOI> [<ffffffff81434a6f>] ? cpuidle_enter_state+0x4f/0xd0 > > [<ffffffff81434a6b>] ? cpuidle_enter_state+0x4b/0xd0 > > [<ffffffff81434e6f>] cpuidle_idle_call+0xcf/0x160 > > [<ffffffff8100c5ee>] arch_cpu_idle+0xe/0x30 > > [<ffffffff8109ed5a>] cpu_idle_loop+0x9a/0x230 > > [<ffffffff8109ef60>] cpu_startup_entry+0x70/0x80 > > [<ffffffff81533f07>] rest_init+0x77/0x80 > > [<ffffffff81b1017c>] start_kernel+0x3e3/0x3f0 > > [<ffffffff81b0fbc7>] ? repair_env_string+0x5b/0x5b > > [<ffffffff81b0f59d>] x86_64_start_reservations+0x2a/0x2c > > [<ffffffff81b0f6d8>] x86_64_start_kernel+0x139/0x140 > > bad: scheduling from the idle thread! > > CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W O 3.12.14 #1 > > Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS > > S1200BT.86B.02.00.0035.030220120927 03/02/2012 > > ffffffff81a10480 ffff880237003970 ffffffff8153b1c2 > > 0000000000000000 > > ffff880237012d00 ffff880237003990 ffffffff810829f9 > > ffff8802370039a0 > > ffff880237012d00 ffff8802370039c0 ffffffff8107add3 > > ffff8802370039c0 > > Call Trace: > > <IRQ> [<ffffffff8153b1c2>] dump_stack+0x49/0x5f > > [<ffffffff810829f9>] dequeue_task_idle+0x29/0x40 > > [<ffffffff8107add3>] dequeue_task+0x93/0xb0 > > [<ffffffff8107ae12>] deactivate_task+0x22/0x30 > > [<ffffffff8153d5bc>] __schedule+0x48c/0x6f0 > > [<ffffffff8153d8e9>] schedule+0x29/0x70 > > [<ffffffff8153b375>] schedule_timeout+0x195/0x220 > > [<ffffffff8153f6aa>] ? common_interrupt+0x6a/0x6a > > [<ffffffff8153dcb6>] wait_for_completion+0xc6/0x100 > > [<ffffffff8107ff80>] ? try_to_wake_up+0x2c0/0x2c0 > > [<ffffffff8106ee1d>] kthread_create_on_node+0x9d/0x120 > > [<ffffffffa0533840>] ? drbd_destroy_device+0x180/0x180 [drbd] > > [<ffffffff8153b144>] ? printk+0x4d/0x4f > > [<ffffffff8105b176>] ? __flush_signals+0x36/0x40 > > [<ffffffffa0533a7a>] drbd_thread_start+0x11a/0x290 [drbd] > > [<ffffffffa0533cd8>] _drbd_thread_stop+0xe8/0x100 [drbd] > > [<ffffffffa0544996>] __drbd_set_state+0xb86/0xcd0 [drbd] > > [<ffffffffa0525c89>] _drbd_set_state.clone.0+0x39/0x50 [drbd] > > [<ffffffffa0527f2c>] request_timer_fn+0x3fc/0x560 [drbd] > > [<ffffffff810668a0>] ? flush_delayed_work+0x40/0x40 > > [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0 > > [drbd] > > [<ffffffff810581d9>] call_timer_fn+0x49/0x120 > > [<ffffffff81058db5>] run_timer_softirq+0x225/0x290 > > [<ffffffff8126bd10>] ? timerqueue_add+0x60/0xb0 > > [<ffffffffa0527b30>] ? drbd_submit_req_private_bio+0x2b0/0x2b0 > > [drbd] > > [<ffffffff81050dbf>] __do_softirq+0xef/0x260 > > [<ffffffff8107263b>] ? hrtimer_interrupt+0x15b/0x250 > > [<ffffffff8154905c>] call_softirq+0x1c/0x30 > > [<ffffffff81005805>] do_softirq+0x55/0x90 > > [<ffffffff81050af5>] irq_exit+0xb5/0xc0 > > [<ffffffff8154987a>] smp_apic_timer_interrupt+0x4a/0x5a > > [<ffffffff8154858a>] apic_timer_interrupt+0x6a/0x70 > > <EOI> [<ffffffff81434a6f>] ? cpuidle_enter_state+0x4f/0xd0 > > [<ffffffff81434a6b>] ? cpuidle_enter_state+0x4b/0xd0 > > [<ffffffff81434e6f>] cpuidle_idle_call+0xcf/0x160 > > [<ffffffff8100c5ee>] arch_cpu_idle+0xe/0x30 > > [<ffffffff8109ed5a>] cpu_idle_loop+0x9a/0x230 > > [<ffffffff8109ef60>] cpu_startup_entry+0x70/0x80 > > [<ffffffff81533f07>] rest_init+0x77/0x80 > > [<ffffffff81b1017c>] start_kernel+0x3e3/0x3f0 > > [<ffffffff81b0fbc7>] ? repair_env_string+0x5b/0x5b > > [<ffffffff81b0f59d>] x86_64_start_reservations+0x2a/0x2c > > [<ffffffff81b0f6d8>] x86_64_start_kernel+0x139/0x140 > > BUG: unable to handle kernel NULL pointer dereference at > > (null) > > IP: [< (null)>] (null) > > PGD 0 > > Oops: 0010 [#1] SMP > > Modules linked in: iscsi_scst(O) ib_srpt(O) scst_vdisk(O) scst(O) > > crc32c_intel drbd(O) libcrc32c autofs4 cpufreq_ondemand bonding > > ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack > > ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs > > ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa uinput iTCO_wdt > > iTCO_vendor_support raid0 raid10 microcode pcspkr i2c_i801 > > i2c_core lpc_ich mfd_core sg e1000e ptp pps_core ib_qib dca ib_mad > > ib_core acpi_cpufreq freq_table ext4 jbd2 mbcache raid1 sd_mod > > crc_t10dif crct10dif_common ahci libahci video 3w_9xxx dm_mirror > > dm_region_hash dm_log dm_mod [last unloaded: drbd] > > CPU: 7 PID: 52888 Comm: kthreadd Tainted: G W O 3.12.14 #1 > > Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS > > S1200BT.86B.02.00.0035.030220120927 03/02/2012 > > task: ffff8800a3893540 ti: ffff8800771ac000 task.ti: > > ffff8800771ac000 > > RIP: 0010:[<0000000000000000>] [< (null)>] > > (null) > > RSP: 0018:ffff8800771add60 EFLAGS: 00010046 > > RAX: ffffffff81613440 RBX: ffff880237012d00 RCX: 000000000000b7a8 > > RDX: 0000000000000005 RSI: ffffffff81a10480 RDI: ffff880237012d00 > > RBP: ffff8800771add88 R08: 0000000000000000 R09: 00087a835b6f23c0 > > R10: ffffffff816133e5 R11: 0000000000000001 R12: ffffffff81a10480 > > R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000086 > > FS: 0000000000000000(0000) GS:ffff8802370e0000(0000) > > knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 0000000000000000 CR3: 0000000001a0b000 CR4: 00000000001407e0 > > Stack: > > ffffffff8107ae8a 00007f1bb02036a0 ffffffff81a10480 > > ffff880237012d00 > > ffffffff81a10c84 ffff8800771add98 ffffffff8107b992 > > ffff8800771addb8 > > ffffffff8107b9c1 ffff880237012d00 ffffffff81a10480 > > ffff8800771addd8 > > Call Trace: > > [<ffffffff8107ae8a>] ? enqueue_task+0x6a/0x80 > > [<ffffffff8107b992>] activate_task+0x22/0x30 > > [<ffffffff8107b9c1>] ttwu_activate+0x21/0x50 > > [<ffffffff8107c2bc>] ttwu_do_activate.clone.0+0x3c/0x60 > > [<ffffffff8107fec6>] try_to_wake_up+0x206/0x2c0 > > [<ffffffff8107ff92>] default_wake_function+0x12/0x20 > > [<ffffffff81077659>] __wake_up_common+0x59/0x90 > > [<ffffffffa0533840>] ? drbd_destroy_device+0x180/0x180 [drbd] > > [<ffffffff81079666>] complete+0x46/0x60 > > [<ffffffff8106e97e>] kthread+0x9e/0xe0 > > [<ffffffff8106e8e0>] ? kthread_freezable_should_stop+0x70/0x70 > > [<ffffffff8154792c>] ret_from_fork+0x7c/0xb0 > > [<ffffffff8106e8e0>] ? kthread_freezable_should_stop+0x70/0x70 > > Code: Bad RIP value. > > RIP [< (null)>] (null) > > RSP <ffff8800771add60> > > CR2: 0000000000000000 > > ---[ end trace f6ab84a85de85d6c ]--- > > > > Kernel panic - not syncing: Fatal exception > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > __ > please don't Cc me, but send to list -- I'm subscribed > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user >