[DRBD-user] Softlockup when using 9.0.15-1 version
kvaps
kvapss at gmail.com
Sun Sep 30 13:45:17 CEST 2018
Hi, ok. Here is my kernel traces for the 4.15.18-5-pve kernel, I use
it with both: bionic and xenial ubuntu systems.
There is also out-of-tree mlnx_en driver installed (version 4.3-1.0.1)
I use VFIO devices connected via bonding for DRBD communication.
As I said before the problem occurs only with 9.0.15-1 kernel module,
and there is no problems on 9.0.14-1 version.
I'm not sure but maybe it is connected with version mismatch between
9.0.14-1 and 9.0.15-1 during replication process is running?
-------------------------------------------------------------------------------
[ 673.805377] NMI watchdog: Watchdog detected hard LOCKUP on cpu 7
[ 683.134748] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
[ 688.597202] watchdog: BUG: soft lockup - CPU#4 stuck for 23s!
[kworker/4:2:3483]
[ 689.539292] Kernel panic - not syncing: softlockup: hung tasks
[ 689.822358] CPU: 4 PID: 3483 Comm: kworker/4:2 Tainted: P
O L 4.15.18-5-pve #1
[ 690.223400] Hardware name: HP ProLiant m710x Server
Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018
[ 690.726310] Workqueue: events wait_rcu_exp_gp
[ 690.937639] Call Trace:
[ 691.055597] <IRQ>
[ 691.152632] dump_stack+0x63/0x8b
[ 691.314120] panic+0xe4/0x244
[ 691.457890] watchdog_timer_fn+0x21c/0x230
[ 691.656303] ? watchdog+0x30/0x30
[ 691.816636] __hrtimer_run_queues+0xe7/0x220
[ 692.023631] hrtimer_interrupt+0xa3/0x1e0
[ 692.218525] smp_apic_timer_interrupt+0x6f/0x130
[ 692.442133] apic_timer_interrupt+0x84/0x90
[ 692.644699] </IRQ>
[ 692.746094] RIP: 0010:smp_call_function_single+0x8e/0x100
[ 693.007742] RSP: 0018:ffffbb8a07443d80 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff11
[ 693.375128] RAX: 0000000000000001 RBX: ffffffffbb4a9200 RCX: 0000000000000000
[ 693.721604] RDX: ffffffffbb4a9200 RSI: ffffffffba0feac0 RDI: 0000000000000005
[ 694.068179] RBP: ffffbb8a07443dd0 R08: ffff96bf81523880 R09: 0000000000000005
[ 694.414734] R10: ffffbb8a07443df8 R11: 0000000000000280 R12: 000000000000005e
[ 694.760804] R13: ffffffffbb4a9200 R14: 00000000000000a0 R15: 0000000000000001
[ 695.107153] ? rcu_barrier_func+0x50/0x50
[ 695.301204] sync_rcu_exp_select_cpus+0x2ad/0x420
[ 695.529610] ? cpumask_next+0x1b/0x20
[ 695.707302] ? sync_rcu_exp_select_cpus+0x2ad/0x420
[ 695.944146] ? rcu_barrier_func+0x50/0x50
[ 696.138698] wait_rcu_exp_gp+0x20/0x30
[ 696.320144] process_one_work+0x1e0/0x400
[ 696.514658] worker_thread+0x4b/0x420
[ 696.692225] kthread+0x105/0x140
[ 696.848570] ? process_one_work+0x400/0x400
[ 697.051278] ? kthread_create_worker_on_cpu+0x70/0x70
[ 697.296928] ? do_syscall_64+0x73/0x130
[ 697.483045] ? SyS_exit_group+0x14/0x20
[ 697.669355] ret_from_fork+0x35/0x40
[ 698.924648] Shutting down cpus with NMI
[ 699.110150] Kernel Offset: 0x39000000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 699.638057] Rebooting in 10 seconds..
[ 709.736369] ACPI MEMORY or I/O RESET_REG.
-------------------------------------------------------------------------------
[ 9596.201090] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [systemd:12398]
[ 9596.548566] Kernel panic - not syncing: softlockup: hung tasks
[ 9596.831632] CPU: 5 PID: 12398 Comm: systemd Tainted: P O
L 4.15.18-5-pve #1
[ 9597.220358] Hardware name: HP ProLiant m710x Server
Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018
[ 9597.724072] Call Trace:
[ 9597.842902] <IRQ>
[ 9597.940248] dump_stack+0x63/0x8b
[ 9598.100587] panic+0xe4/0x244
[ 9598.244267] watchdog_timer_fn+0x21c/0x230
[ 9598.442988] ? watchdog+0x30/0x30
[ 9598.603634] __hrtimer_run_queues+0xe7/0x220
[ 9598.811388] hrtimer_interrupt+0xa3/0x1e0
[ 9599.006050] smp_apic_timer_interrupt+0x6f/0x130
[ 9599.230283] apic_timer_interrupt+0x84/0x90
[ 9599.433052] </IRQ>
[ 9599.534382] RIP: 0010:smp_call_function_many+0x1f9/0x260
[ 9599.791986] RSP: 0018:ffffab5ca6027b40 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff11
[ 9600.155228] RAX: 0000000000000000 RBX: ffffa00801563900 RCX: ffffa008014288e0
[ 9600.501585] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffa007c0844070
[ 9600.847949] RBP: ffffab5ca6027b78 R08: ffffffffffffffff R09: 00000000000000df
[ 9601.194242] R10: ffffd877bfaf9c00 R11: 0000000000000095 R12: ffffffff8767b7d0
[ 9601.541932] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000008
[ 9601.888206] ? load_new_mm_cr3+0xf0/0xf0
[ 9602.078392] ? load_new_mm_cr3+0xf0/0xf0
[ 9602.268282] on_each_cpu+0x2d/0x60
[ 9602.433017] flush_tlb_kernel_range+0x4b/0x80
[ 9602.644043] ? purge_fragmented_blocks_allcpus+0x53/0x1f0
[ 9602.906196] __purge_vmap_area_lazy+0x52/0xc0
[ 9603.117778] vm_unmap_aliases+0xfa/0x130
[ 9603.307708] change_page_attr_set_clr+0xea/0x370
[ 9603.531861] set_memory_ro+0x29/0x30
[ 9603.705108] bpf_int_jit_compile+0x1e0/0x2d0
[ 9603.912190] bpf_prog_select_runtime+0xdb/0x100
[ 9604.132496] bpf_prepare_filter+0x387/0x400
[ 9604.335457] __get_filter+0xc3/0x100
[ 9604.508597] sk_attach_filter+0x18/0x60
[ 9604.694663] sock_setsockopt+0x43e/0x990
[ 9604.884914] ? apparmor_socket_setsockopt+0x22/0x30
[ 9605.121579] SyS_setsockopt+0xd3/0xf0
[ 9605.298778] do_syscall_64+0x73/0x130
[ 9605.476152] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 9605.721497] RIP: 0033:0x7efd49ed5e6a
[ 9605.894823] RSP: 002b:00007ffdefb136e8 EFLAGS: 00000246 ORIG_RAX:
0000000000000036
[ 9606.263666] RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007efd49ed5e6a
[ 9606.610963] RDX: 000000000000001a RSI: 0000000000000001 RDI: 0000000000000009
[ 9606.959240] RBP: 00007ffdefb13710 R08: 0000000000000010 R09: 0000000000000006
[ 9607.306903] R10: 00007ffdefb13700 R11: 0000000000000246 R12: 000055b51a1a03c8
[ 9607.655192] R13: 000000000000000a R14: 00007ffdefb13760 R15: 000055b51a1a0230
[ 9609.124819] Shutting down cpus with NMI
[ 9609.311447] Kernel Offset: 0x6600000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 9609.840620] Rebooting in 10 seconds..
-------------------------------------------------------------------------------
[ 384.679007] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
[ 384.781477] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
[ 396.205982] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [sensors:12097]
[ 397.136095] Kernel panic - not syncing: softlockup: hung tasks
[ 397.418904] CPU: 2 PID: 12097 Comm: sensors Tainted: P O
L 4.15.18-5-pve #1
[ 397.807613] Hardware name: HP ProLiant m710x Server
Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018
[ 398.311096] Call Trace:
[ 398.429564] <IRQ>
[ 398.526903] dump_stack+0x63/0x8b
[ 398.687306] panic+0xe4/0x244
[ 398.830885] watchdog_timer_fn+0x21c/0x230
[ 399.029325] ? watchdog+0x30/0x30
[ 399.189827] __hrtimer_run_queues+0xe7/0x220
[ 399.397121] hrtimer_interrupt+0xa3/0x1e0
[ 399.591510] smp_apic_timer_interrupt+0x6f/0x130
[ 399.815278] apic_timer_interrupt+0x84/0x90
[ 400.018834] </IRQ>
[ 400.120069] RIP: 0010:smp_call_function_single+0xdd/0x100
[ 400.381752] RSP: 0018:ffffb5862411fca0 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff11
[ 400.744894] RAX: 0000000000000000 RBX: ffffb5862411fd6c RCX: 0000000000000830
[ 401.091501] RDX: 0000000000000001 RSI: 00000000000008fb RDI: 0000000000000830
[ 401.433770] RBP: ffffb5862411fcf0 R08: ffff9d389560cc18 R09: ffff9d388a75e180
[ 401.775868] R10: ffffb5862411fd20 R11: 0000000000000000 R12: ffffb5862411fd68
[ 402.124117] R13: ffff9d388a47b000 R14: 0000000000000001 R15: ffff9d3873594080
[ 402.470882] ? sbitmap_queue_init_node+0x1a0/0x1a0
[ 402.703603] ? vsnprintf+0xf1/0x4e0
[ 402.872647] rdmsr_on_cpu+0x5d/0x90
[ 403.042378] ? rdmsr_on_cpu+0x5d/0x90
[ 403.219880] show_temp+0xa9/0xe0 [coretemp]
[ 403.422975] dev_attr_show+0x23/0x60
[ 403.596290] ? mutex_lock+0x12/0x40
[ 403.765190] sysfs_kf_seq_show+0xa2/0x130
[ 403.959449] kernfs_seq_show+0x27/0x30
[ 404.141147] seq_read+0x12a/0x430
[ 404.301845] kernfs_fop_read+0x116/0x180
[ 404.492090] ? security_file_permission+0xa1/0xc0
[ 404.720388] __vfs_read+0x1b/0x40
[ 404.881143] vfs_read+0x96/0x130
[ 405.037746] SyS_read+0x55/0xc0
[ 405.189772] do_syscall_64+0x73/0x130
[ 405.367127] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 405.612262] RIP: 0033:0x7efc19c3e260
[ 405.785308] RSP: 002b:00007fff2685b538 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 406.153341] RAX: ffffffffffffffda RBX: 0000000002129880 RCX: 00007efc19c3e260
[ 406.495101] RDX: 0000000000001000 RSI: 0000000002129ab0 RDI: 0000000000000003
[ 406.842407] RBP: 00007fff2685bc50 R08: 00007efc19f0bb98 R09: 0000000000000000
[ 407.189117] R10: 00007efc19f0bb78 R11: 0000000000000246 R12: 0000000000000000
[ 407.536464] R13: ffffffffffffff98 R14: 00007efc19f0c420 R15: 0000000002129880
[ 407.853825] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 407.853829] 0-...0: (1 GPs behind) idle=ca2/140000000000001/0
softirq=23498/23499 fqs=7290
[ 407.853831] 1-...0: (3 GPs behind) idle=346/140000000000000/0
softirq=22199/22200 fqs=7290
[ 407.853831] (detected by 4, t=15002 jiffies, g=15073, c=15072, q=46784)
[ 407.853834] Sending NMI from CPU 4 to CPUs 0:
[ 407.853837] NMI backtrace for cpu 0
[ 407.854048] CPU: 0 PID: 8730 Comm: drbd_r_tgt112 Tainted: P
O L 4.15.18-5-pve #1
[ 407.854049] Hardware name: HP ProLiant m710x Server
Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018
[ 407.854053] RIP: 0010:native_queued_spin_lock_slowpath+0x179/0x1a0
[ 407.854054] RSP: 0018:ffff9d3901403bd8 EFLAGS: 00000002
[ 407.854055] RAX: 0000000000080101 RBX: 0000000000000086 RCX: 0000000000000001
[ 407.854056] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff9d386e6202b8
[ 407.854057] RBP: ffff9d3901403bd8 R08: 0000000000000101 R09: 0000000000000000
[ 407.854057] R10: 0000000000000356 R11: 0000000000000046 R12: ffff9d386e620000
[ 407.854058] R13: 0000000000767600 R14: ffff9d386e699000 R15: 000000000000001d
[ 407.854059] FS: 0000000000000000(0000) GS:ffff9d3901400000(0000)
knlGS:0000000000000000
[ 407.854060] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 407.854061] CR2: 00007ffcf041e090 CR3: 00000008a160a003 CR4: 00000000003606f0
[ 407.854273] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 407.854275] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 407.854275] Call Trace:
[ 407.854276] <IRQ>
[ 407.854280] _raw_spin_lock_irqsave+0x37/0x40
[ 407.854291] drbd_rs_complete_io+0x3e/0x150 [drbd]
[ 407.854298] drbd_endio_write_sec_final+0x3c5/0x450 [drbd]
[ 407.854304] drbd_peer_request_endio+0x160/0x3c0 [drbd]
[ 407.854305] bio_endio+0xa2/0x120
[ 407.854308] dec_pending+0x124/0x250
[ 407.854310] clone_endio+0x88/0x130
[ 407.854311] bio_endio+0xa2/0x120
[ 407.854312] blk_update_request+0x98/0x2d0
[ 407.854314] blk_mq_end_request+0x1e/0x90
[ 407.854316] nvme_complete_rq+0x23/0x120
[ 407.854318] nvme_pci_complete_rq+0x7a/0x130
[ 407.854320] __blk_mq_complete_request+0x99/0x150
[ 407.854321] blk_mq_complete_request+0x17/0x20
[ 407.854323] nvme_process_cq+0xe1/0x1b0
[ 407.854325] nvme_irq+0x23/0x50
[ 407.854326] __handle_irq_event_percpu+0x84/0x1a0
[ 407.854328] handle_irq_event_percpu+0x32/0x80
[ 407.854330] handle_irq_event+0x3b/0x60
[ 407.854331] handle_edge_irq+0x78/0x1a0
[ 407.854333] handle_irq+0x20/0x30
[ 407.854335] do_IRQ+0x4e/0xd0
[ 407.854336] common_interrupt+0x84/0x84
[ 407.854337] </IRQ>
[ 407.854339] RIP: 0010:_raw_spin_lock+0x10/0x30
[ 407.854339] RSP: 0018:ffffb586081c7d70 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffffdc
[ 407.854341] RAX: 0000000000000000 RBX: ffff9d3856c82000 RCX: 0000000000000000
[ 407.854341] RDX: 0000000000000001 RSI: 00000000000122e7 RDI: ffff9d386e6202b8
[ 407.854342] RBP: ffffb586081c7e50 R08: 0000000000001000 R09: ffff9d386e620000
[ 407.854343] R10: ffff9d388a6a97e8 R11: ffff9d386e699000 R12: ffff9d386e699000
[ 407.854343] R13: ffff9d386e620000 R14: ffff9d386e6202b8 R15: ffff9d388a6a97e8
[ 407.854351] ? receive_Data+0x6d3/0x15b0 [drbd]
[ 407.854353] ? dtt_recv+0xd8/0x180 [drbd_transport_tcp]
[ 407.854360] ? receive_RSDataReply+0x370/0x370 [drbd]
[ 407.854365] ? receive_RSDataReply+0x370/0x370 [drbd]
[ 407.854371] drbd_receiver+0x45c/0x6a0 [drbd]
[ 407.854379] drbd_thread_setup+0x79/0x190 [drbd]
[ 407.854381] kthread+0x105/0x140
[ 407.854387] ? __drbd_next_peer_device_ref+0x170/0x170 [drbd]
[ 407.854389] ? kthread_create_worker_on_cpu+0x70/0x70
[ 407.854391] ? do_syscall_64+0x73/0x130
[ 407.854392] ? SyS_exit_group+0x14/0x20
[ 407.854394] ret_from_fork+0x35/0x40
-------------------------------------------------------------------------------
[ 524.920614] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
[ 531.795413] NMI watchdog: Watchdog detected hard LOCKUP on cpu 2
[ 588.202242] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [sensors:13479]
[ 589.130264] Kernel panic - not syncing: softlockup: hung tasks
[ 589.412765] CPU: 1 PID: 13479 Comm: sensors Tainted: P O
L 4.15.18-5-pve #1
[ 589.800666] Hardware name: HP ProLiant m710x Server
Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018
[ 590.302717] Call Trace:
[ 590.420675] <IRQ>
[ 590.517554] dump_stack+0x63/0x8b
[ 590.677690] panic+0xe4/0x244
[ 590.821054] watchdog_timer_fn+0x21c/0x230
[ 591.019107] ? watchdog+0x30/0x30
[ 591.179362] __hrtimer_run_queues+0xe7/0x220
[ 591.385849] hrtimer_interrupt+0xa3/0x1e0
[ 591.579709] smp_apic_timer_interrupt+0x6f/0x130
[ 591.803262] apic_timer_interrupt+0x84/0x90
[ 592.005528] </IRQ>
[ 592.107069] RIP: 0010:smp_call_function_single+0xdd/0x100
[ 592.368500] RSP: 0018:ffffb7b2a482bca0 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff11
[ 592.735159] RAX: 0000000000000000 RBX: ffffb7b2a482bd6c RCX: 0000000000000830
[ 593.081059] RDX: 0000000000000001 RSI: 00000000000008fb RDI: 0000000000000830
[ 593.427696] RBP: ffffb7b2a482bcf0 R08: ffff8935adb2fc18 R09: ffff8935a388f9c0
[ 593.773919] R10: ffffb7b2a482bd20 R11: 0000000000000000 R12: ffffb7b2a482bd68
[ 594.119842] R13: ffff8935c1a63000 R14: 0000000000000001 R15: ffff8935c0254100
[ 594.461694] ? sbitmap_queue_init_node+0x1a0/0x1a0
[ 594.693704] rdmsr_on_cpu+0x5d/0x90
[ 594.862401] ? rdmsr_on_cpu+0x5d/0x90
[ 595.039732] show_crit_alarm+0x59/0xa0 [coretemp]
[ 595.267472] dev_attr_show+0x23/0x60
[ 595.440312] ? mutex_lock+0x12/0x40
[ 595.608969] sysfs_kf_seq_show+0xa2/0x130
[ 595.802897] kernfs_seq_show+0x27/0x30
[ 595.984857] seq_read+0x12a/0x430
[ 596.145125] kernfs_fop_read+0x116/0x180
[ 596.334749] ? security_file_permission+0xa1/0xc0
[ 596.562326] __vfs_read+0x1b/0x40
[ 596.722505] vfs_read+0x96/0x130
[ 596.874230] SyS_read+0x55/0xc0
[ 597.025935] do_syscall_64+0x73/0x130
[ 597.203036] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 597.447495] RIP: 0033:0x7f71ed58d260
[ 597.620432] RSP: 002b:00007ffe54e32e48 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 597.986899] RAX: ffffffffffffffda RBX: 0000000000be9880 RCX: 00007f71ed58d260
[ 598.332523] RDX: 0000000000001000 RSI: 0000000000be9ab0 RDI: 0000000000000003
[ 598.678088] RBP: 00007ffe54e33560 R08: 00007f71ed85b278 R09: 00007f71edc8d700
[ 599.024604] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 599.370916] R13: ffffffffffffff98 R14: 00007f71ed85b420 R15: 0000000000be9880
[ 600.818687] Shutting down cpus with NMI
[ 601.004760] Kernel Offset: 0x1dc00000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 601.533984] Rebooting in 10 seconds..
- kvaps
On Fri, Sep 28, 2018 at 4:06 PM Lars Ellenberg
<lars.ellenberg at linbit.com> wrote:
>
> On Thu, Sep 27, 2018 at 05:49:43PM +0200, kvaps wrote:
> > Hi, I found out that when I using 9.0.15-1 version of drbd kernel
> > module, I have softlockup problems quite often.
> >
> > For now I have many nodes with 9.0.14-1 and they are working fine.
> >
> > When I add new nodes with 9.0.15-1, then distribute old resources on
> > them, after a while I start having random reboots of the new nodes,
> > during synchronization.
> >
> > New nodes isn't using resources at all, kernel log shows, that it was
> > tainted on some random processes like systemd, sensors, etc.
> >
> > After downgrade to the version to 9.0.14-1, synchronization is finishing fine.
> >
> > I use 4.15.18-5-pve kernel, and I can provide my kernel stack traces
> > if you want to.
>
> Sure.
> otherwise this is a non-actionable "$something does not work" report.
>
> --
> : Lars Ellenberg
> : LINBIT | Keeping the Digital World Running
> : DRBD -- Heartbeat -- Corosync -- Pacemaker
>
> DRBD® and LINBIT® are registered trademarks of LINBIT
> __
> please don't Cc me, but send to list -- I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
More information about the drbd-user
mailing list