[DRBD-user] Softlockup when using 9.0.15-1 version

kvaps kvapss at gmail.com
Sun Sep 30 13:45:17 CEST 2018


Hi, ok. Here is my kernel traces for the 4.15.18-5-pve kernel, I use
it with both: bionic and xenial ubuntu systems.
There is also out-of-tree mlnx_en driver installed (version 4.3-1.0.1)
I use VFIO devices connected via bonding for DRBD communication.
As I said before the problem occurs only with 9.0.15-1 kernel module,
and there is no problems on 9.0.14-1 version.

I'm not sure but maybe it is connected with version mismatch between
9.0.14-1 and 9.0.15-1 during replication process is running?

-------------------------------------------------------------------------------

[  673.805377] NMI watchdog: Watchdog detected hard LOCKUP on cpu 7
[  683.134748] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
[  688.597202] watchdog: BUG: soft lockup - CPU#4 stuck for 23s!
[kworker/4:2:3483]
[  689.539292] Kernel panic - not syncing: softlockup: hung tasks
[  689.822358] CPU: 4 PID: 3483 Comm: kworker/4:2 Tainted: P
O L   4.15.18-5-pve #1
[  690.223400] Hardware name: HP ProLiant m710x Server
Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018
[  690.726310] Workqueue: events wait_rcu_exp_gp
[  690.937639] Call Trace:
[  691.055597]  <IRQ>
[  691.152632]  dump_stack+0x63/0x8b
[  691.314120]  panic+0xe4/0x244
[  691.457890]  watchdog_timer_fn+0x21c/0x230
[  691.656303]  ? watchdog+0x30/0x30
[  691.816636]  __hrtimer_run_queues+0xe7/0x220
[  692.023631]  hrtimer_interrupt+0xa3/0x1e0
[  692.218525]  smp_apic_timer_interrupt+0x6f/0x130
[  692.442133]  apic_timer_interrupt+0x84/0x90
[  692.644699]  </IRQ>
[  692.746094] RIP: 0010:smp_call_function_single+0x8e/0x100
[  693.007742] RSP: 0018:ffffbb8a07443d80 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff11
[  693.375128] RAX: 0000000000000001 RBX: ffffffffbb4a9200 RCX: 0000000000000000
[  693.721604] RDX: ffffffffbb4a9200 RSI: ffffffffba0feac0 RDI: 0000000000000005
[  694.068179] RBP: ffffbb8a07443dd0 R08: ffff96bf81523880 R09: 0000000000000005
[  694.414734] R10: ffffbb8a07443df8 R11: 0000000000000280 R12: 000000000000005e
[  694.760804] R13: ffffffffbb4a9200 R14: 00000000000000a0 R15: 0000000000000001
[  695.107153]  ? rcu_barrier_func+0x50/0x50
[  695.301204]  sync_rcu_exp_select_cpus+0x2ad/0x420
[  695.529610]  ? cpumask_next+0x1b/0x20
[  695.707302]  ? sync_rcu_exp_select_cpus+0x2ad/0x420
[  695.944146]  ? rcu_barrier_func+0x50/0x50
[  696.138698]  wait_rcu_exp_gp+0x20/0x30
[  696.320144]  process_one_work+0x1e0/0x400
[  696.514658]  worker_thread+0x4b/0x420
[  696.692225]  kthread+0x105/0x140
[  696.848570]  ? process_one_work+0x400/0x400
[  697.051278]  ? kthread_create_worker_on_cpu+0x70/0x70
[  697.296928]  ? do_syscall_64+0x73/0x130
[  697.483045]  ? SyS_exit_group+0x14/0x20
[  697.669355]  ret_from_fork+0x35/0x40
[  698.924648] Shutting down cpus with NMI
[  699.110150] Kernel Offset: 0x39000000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  699.638057] Rebooting in 10 seconds..
[  709.736369] ACPI MEMORY or I/O RESET_REG.

-------------------------------------------------------------------------------

[ 9596.201090] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [systemd:12398]
[ 9596.548566] Kernel panic - not syncing: softlockup: hung tasks
[ 9596.831632] CPU: 5 PID: 12398 Comm: systemd Tainted: P           O
L   4.15.18-5-pve #1
[ 9597.220358] Hardware name: HP ProLiant m710x Server
Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018
[ 9597.724072] Call Trace:
[ 9597.842902]  <IRQ>
[ 9597.940248]  dump_stack+0x63/0x8b
[ 9598.100587]  panic+0xe4/0x244
[ 9598.244267]  watchdog_timer_fn+0x21c/0x230
[ 9598.442988]  ? watchdog+0x30/0x30
[ 9598.603634]  __hrtimer_run_queues+0xe7/0x220
[ 9598.811388]  hrtimer_interrupt+0xa3/0x1e0
[ 9599.006050]  smp_apic_timer_interrupt+0x6f/0x130
[ 9599.230283]  apic_timer_interrupt+0x84/0x90
[ 9599.433052]  </IRQ>
[ 9599.534382] RIP: 0010:smp_call_function_many+0x1f9/0x260
[ 9599.791986] RSP: 0018:ffffab5ca6027b40 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff11
[ 9600.155228] RAX: 0000000000000000 RBX: ffffa00801563900 RCX: ffffa008014288e0
[ 9600.501585] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffa007c0844070
[ 9600.847949] RBP: ffffab5ca6027b78 R08: ffffffffffffffff R09: 00000000000000df
[ 9601.194242] R10: ffffd877bfaf9c00 R11: 0000000000000095 R12: ffffffff8767b7d0
[ 9601.541932] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000008
[ 9601.888206]  ? load_new_mm_cr3+0xf0/0xf0
[ 9602.078392]  ? load_new_mm_cr3+0xf0/0xf0
[ 9602.268282]  on_each_cpu+0x2d/0x60
[ 9602.433017]  flush_tlb_kernel_range+0x4b/0x80
[ 9602.644043]  ? purge_fragmented_blocks_allcpus+0x53/0x1f0
[ 9602.906196]  __purge_vmap_area_lazy+0x52/0xc0
[ 9603.117778]  vm_unmap_aliases+0xfa/0x130
[ 9603.307708]  change_page_attr_set_clr+0xea/0x370
[ 9603.531861]  set_memory_ro+0x29/0x30
[ 9603.705108]  bpf_int_jit_compile+0x1e0/0x2d0
[ 9603.912190]  bpf_prog_select_runtime+0xdb/0x100
[ 9604.132496]  bpf_prepare_filter+0x387/0x400
[ 9604.335457]  __get_filter+0xc3/0x100
[ 9604.508597]  sk_attach_filter+0x18/0x60
[ 9604.694663]  sock_setsockopt+0x43e/0x990
[ 9604.884914]  ? apparmor_socket_setsockopt+0x22/0x30
[ 9605.121579]  SyS_setsockopt+0xd3/0xf0
[ 9605.298778]  do_syscall_64+0x73/0x130
[ 9605.476152]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 9605.721497] RIP: 0033:0x7efd49ed5e6a
[ 9605.894823] RSP: 002b:00007ffdefb136e8 EFLAGS: 00000246 ORIG_RAX:
0000000000000036
[ 9606.263666] RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007efd49ed5e6a
[ 9606.610963] RDX: 000000000000001a RSI: 0000000000000001 RDI: 0000000000000009
[ 9606.959240] RBP: 00007ffdefb13710 R08: 0000000000000010 R09: 0000000000000006
[ 9607.306903] R10: 00007ffdefb13700 R11: 0000000000000246 R12: 000055b51a1a03c8
[ 9607.655192] R13: 000000000000000a R14: 00007ffdefb13760 R15: 000055b51a1a0230
[ 9609.124819] Shutting down cpus with NMI
[ 9609.311447] Kernel Offset: 0x6600000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 9609.840620] Rebooting in 10 seconds..

-------------------------------------------------------------------------------

[  384.679007] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
[  384.781477] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
[  396.205982] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [sensors:12097]
[  397.136095] Kernel panic - not syncing: softlockup: hung tasks
[  397.418904] CPU: 2 PID: 12097 Comm: sensors Tainted: P           O
L   4.15.18-5-pve #1
[  397.807613] Hardware name: HP ProLiant m710x Server
Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018
[  398.311096] Call Trace:
[  398.429564]  <IRQ>
[  398.526903]  dump_stack+0x63/0x8b
[  398.687306]  panic+0xe4/0x244
[  398.830885]  watchdog_timer_fn+0x21c/0x230
[  399.029325]  ? watchdog+0x30/0x30
[  399.189827]  __hrtimer_run_queues+0xe7/0x220
[  399.397121]  hrtimer_interrupt+0xa3/0x1e0
[  399.591510]  smp_apic_timer_interrupt+0x6f/0x130
[  399.815278]  apic_timer_interrupt+0x84/0x90
[  400.018834]  </IRQ>
[  400.120069] RIP: 0010:smp_call_function_single+0xdd/0x100
[  400.381752] RSP: 0018:ffffb5862411fca0 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff11
[  400.744894] RAX: 0000000000000000 RBX: ffffb5862411fd6c RCX: 0000000000000830
[  401.091501] RDX: 0000000000000001 RSI: 00000000000008fb RDI: 0000000000000830
[  401.433770] RBP: ffffb5862411fcf0 R08: ffff9d389560cc18 R09: ffff9d388a75e180
[  401.775868] R10: ffffb5862411fd20 R11: 0000000000000000 R12: ffffb5862411fd68
[  402.124117] R13: ffff9d388a47b000 R14: 0000000000000001 R15: ffff9d3873594080
[  402.470882]  ? sbitmap_queue_init_node+0x1a0/0x1a0
[  402.703603]  ? vsnprintf+0xf1/0x4e0
[  402.872647]  rdmsr_on_cpu+0x5d/0x90
[  403.042378]  ? rdmsr_on_cpu+0x5d/0x90
[  403.219880]  show_temp+0xa9/0xe0 [coretemp]
[  403.422975]  dev_attr_show+0x23/0x60
[  403.596290]  ? mutex_lock+0x12/0x40
[  403.765190]  sysfs_kf_seq_show+0xa2/0x130
[  403.959449]  kernfs_seq_show+0x27/0x30
[  404.141147]  seq_read+0x12a/0x430
[  404.301845]  kernfs_fop_read+0x116/0x180
[  404.492090]  ? security_file_permission+0xa1/0xc0
[  404.720388]  __vfs_read+0x1b/0x40
[  404.881143]  vfs_read+0x96/0x130
[  405.037746]  SyS_read+0x55/0xc0
[  405.189772]  do_syscall_64+0x73/0x130
[  405.367127]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  405.612262] RIP: 0033:0x7efc19c3e260
[  405.785308] RSP: 002b:00007fff2685b538 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[  406.153341] RAX: ffffffffffffffda RBX: 0000000002129880 RCX: 00007efc19c3e260
[  406.495101] RDX: 0000000000001000 RSI: 0000000002129ab0 RDI: 0000000000000003
[  406.842407] RBP: 00007fff2685bc50 R08: 00007efc19f0bb98 R09: 0000000000000000
[  407.189117] R10: 00007efc19f0bb78 R11: 0000000000000246 R12: 0000000000000000
[  407.536464] R13: ffffffffffffff98 R14: 00007efc19f0c420 R15: 0000000002129880
[  407.853825] INFO: rcu_sched detected stalls on CPUs/tasks:
[  407.853829]  0-...0: (1 GPs behind) idle=ca2/140000000000001/0
softirq=23498/23499 fqs=7290
[  407.853831]  1-...0: (3 GPs behind) idle=346/140000000000000/0
softirq=22199/22200 fqs=7290
[  407.853831]  (detected by 4, t=15002 jiffies, g=15073, c=15072, q=46784)
[  407.853834] Sending NMI from CPU 4 to CPUs 0:
[  407.853837] NMI backtrace for cpu 0
[  407.854048] CPU: 0 PID: 8730 Comm: drbd_r_tgt112 Tainted: P
  O L   4.15.18-5-pve #1
[  407.854049] Hardware name: HP ProLiant m710x Server
Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018
[  407.854053] RIP: 0010:native_queued_spin_lock_slowpath+0x179/0x1a0
[  407.854054] RSP: 0018:ffff9d3901403bd8 EFLAGS: 00000002
[  407.854055] RAX: 0000000000080101 RBX: 0000000000000086 RCX: 0000000000000001
[  407.854056] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff9d386e6202b8
[  407.854057] RBP: ffff9d3901403bd8 R08: 0000000000000101 R09: 0000000000000000
[  407.854057] R10: 0000000000000356 R11: 0000000000000046 R12: ffff9d386e620000
[  407.854058] R13: 0000000000767600 R14: ffff9d386e699000 R15: 000000000000001d
[  407.854059] FS:  0000000000000000(0000) GS:ffff9d3901400000(0000)
knlGS:0000000000000000
[  407.854060] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  407.854061] CR2: 00007ffcf041e090 CR3: 00000008a160a003 CR4: 00000000003606f0
[  407.854273] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  407.854275] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  407.854275] Call Trace:
[  407.854276]  <IRQ>
[  407.854280]  _raw_spin_lock_irqsave+0x37/0x40
[  407.854291]  drbd_rs_complete_io+0x3e/0x150 [drbd]
[  407.854298]  drbd_endio_write_sec_final+0x3c5/0x450 [drbd]
[  407.854304]  drbd_peer_request_endio+0x160/0x3c0 [drbd]
[  407.854305]  bio_endio+0xa2/0x120
[  407.854308]  dec_pending+0x124/0x250
[  407.854310]  clone_endio+0x88/0x130
[  407.854311]  bio_endio+0xa2/0x120
[  407.854312]  blk_update_request+0x98/0x2d0
[  407.854314]  blk_mq_end_request+0x1e/0x90
[  407.854316]  nvme_complete_rq+0x23/0x120
[  407.854318]  nvme_pci_complete_rq+0x7a/0x130
[  407.854320]  __blk_mq_complete_request+0x99/0x150
[  407.854321]  blk_mq_complete_request+0x17/0x20
[  407.854323]  nvme_process_cq+0xe1/0x1b0
[  407.854325]  nvme_irq+0x23/0x50
[  407.854326]  __handle_irq_event_percpu+0x84/0x1a0
[  407.854328]  handle_irq_event_percpu+0x32/0x80
[  407.854330]  handle_irq_event+0x3b/0x60
[  407.854331]  handle_edge_irq+0x78/0x1a0
[  407.854333]  handle_irq+0x20/0x30
[  407.854335]  do_IRQ+0x4e/0xd0
[  407.854336]  common_interrupt+0x84/0x84
[  407.854337]  </IRQ>
[  407.854339] RIP: 0010:_raw_spin_lock+0x10/0x30
[  407.854339] RSP: 0018:ffffb586081c7d70 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffffdc
[  407.854341] RAX: 0000000000000000 RBX: ffff9d3856c82000 RCX: 0000000000000000
[  407.854341] RDX: 0000000000000001 RSI: 00000000000122e7 RDI: ffff9d386e6202b8
[  407.854342] RBP: ffffb586081c7e50 R08: 0000000000001000 R09: ffff9d386e620000
[  407.854343] R10: ffff9d388a6a97e8 R11: ffff9d386e699000 R12: ffff9d386e699000
[  407.854343] R13: ffff9d386e620000 R14: ffff9d386e6202b8 R15: ffff9d388a6a97e8
[  407.854351]  ? receive_Data+0x6d3/0x15b0 [drbd]
[  407.854353]  ? dtt_recv+0xd8/0x180 [drbd_transport_tcp]
[  407.854360]  ? receive_RSDataReply+0x370/0x370 [drbd]
[  407.854365]  ? receive_RSDataReply+0x370/0x370 [drbd]
[  407.854371]  drbd_receiver+0x45c/0x6a0 [drbd]
[  407.854379]  drbd_thread_setup+0x79/0x190 [drbd]
[  407.854381]  kthread+0x105/0x140
[  407.854387]  ? __drbd_next_peer_device_ref+0x170/0x170 [drbd]
[  407.854389]  ? kthread_create_worker_on_cpu+0x70/0x70
[  407.854391]  ? do_syscall_64+0x73/0x130
[  407.854392]  ? SyS_exit_group+0x14/0x20
[  407.854394]  ret_from_fork+0x35/0x40


-------------------------------------------------------------------------------


[  524.920614] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
[  531.795413] NMI watchdog: Watchdog detected hard LOCKUP on cpu 2
[  588.202242] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [sensors:13479]
[  589.130264] Kernel panic - not syncing: softlockup: hung tasks
[  589.412765] CPU: 1 PID: 13479 Comm: sensors Tainted: P           O
L   4.15.18-5-pve #1
[  589.800666] Hardware name: HP ProLiant m710x Server
Cartridge/ProLiant m710x Server Cartridge, BIOS H07 05/10/2018
[  590.302717] Call Trace:
[  590.420675]  <IRQ>
[  590.517554]  dump_stack+0x63/0x8b
[  590.677690]  panic+0xe4/0x244
[  590.821054]  watchdog_timer_fn+0x21c/0x230
[  591.019107]  ? watchdog+0x30/0x30
[  591.179362]  __hrtimer_run_queues+0xe7/0x220
[  591.385849]  hrtimer_interrupt+0xa3/0x1e0
[  591.579709]  smp_apic_timer_interrupt+0x6f/0x130
[  591.803262]  apic_timer_interrupt+0x84/0x90
[  592.005528]  </IRQ>
[  592.107069] RIP: 0010:smp_call_function_single+0xdd/0x100
[  592.368500] RSP: 0018:ffffb7b2a482bca0 EFLAGS: 00000202 ORIG_RAX:
ffffffffffffff11
[  592.735159] RAX: 0000000000000000 RBX: ffffb7b2a482bd6c RCX: 0000000000000830
[  593.081059] RDX: 0000000000000001 RSI: 00000000000008fb RDI: 0000000000000830
[  593.427696] RBP: ffffb7b2a482bcf0 R08: ffff8935adb2fc18 R09: ffff8935a388f9c0
[  593.773919] R10: ffffb7b2a482bd20 R11: 0000000000000000 R12: ffffb7b2a482bd68
[  594.119842] R13: ffff8935c1a63000 R14: 0000000000000001 R15: ffff8935c0254100
[  594.461694]  ? sbitmap_queue_init_node+0x1a0/0x1a0
[  594.693704]  rdmsr_on_cpu+0x5d/0x90
[  594.862401]  ? rdmsr_on_cpu+0x5d/0x90
[  595.039732]  show_crit_alarm+0x59/0xa0 [coretemp]
[  595.267472]  dev_attr_show+0x23/0x60
[  595.440312]  ? mutex_lock+0x12/0x40
[  595.608969]  sysfs_kf_seq_show+0xa2/0x130
[  595.802897]  kernfs_seq_show+0x27/0x30
[  595.984857]  seq_read+0x12a/0x430
[  596.145125]  kernfs_fop_read+0x116/0x180
[  596.334749]  ? security_file_permission+0xa1/0xc0
[  596.562326]  __vfs_read+0x1b/0x40
[  596.722505]  vfs_read+0x96/0x130
[  596.874230]  SyS_read+0x55/0xc0
[  597.025935]  do_syscall_64+0x73/0x130
[  597.203036]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  597.447495] RIP: 0033:0x7f71ed58d260
[  597.620432] RSP: 002b:00007ffe54e32e48 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[  597.986899] RAX: ffffffffffffffda RBX: 0000000000be9880 RCX: 00007f71ed58d260
[  598.332523] RDX: 0000000000001000 RSI: 0000000000be9ab0 RDI: 0000000000000003
[  598.678088] RBP: 00007ffe54e33560 R08: 00007f71ed85b278 R09: 00007f71edc8d700
[  599.024604] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  599.370916] R13: ffffffffffffff98 R14: 00007f71ed85b420 R15: 0000000000be9880
[  600.818687] Shutting down cpus with NMI
[  601.004760] Kernel Offset: 0x1dc00000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  601.533984] Rebooting in 10 seconds..



- kvaps

On Fri, Sep 28, 2018 at 4:06 PM Lars Ellenberg
<lars.ellenberg at linbit.com> wrote:
>
> On Thu, Sep 27, 2018 at 05:49:43PM +0200, kvaps wrote:
> > Hi, I found out that when I using 9.0.15-1 version of drbd kernel
> > module, I have softlockup problems quite often.
> >
> > For now I have many nodes with 9.0.14-1 and they are working fine.
> >
> > When I add new nodes with 9.0.15-1, then distribute old resources on
> > them, after a while I start having random reboots of the new nodes,
> > during synchronization.
> >
> > New nodes isn't using resources at all, kernel log shows, that it was
> > tainted on some random processes like systemd, sensors, etc.
> >
> > After downgrade to the version to 9.0.14-1, synchronization is finishing fine.
> >
> > I use 4.15.18-5-pve kernel, and I can provide my kernel stack traces
> > if you want to.
>
> Sure.
> otherwise this is a non-actionable "$something does not work" report.
>
> --
> : Lars Ellenberg
> : LINBIT | Keeping the Digital World Running
> : DRBD -- Heartbeat -- Corosync -- Pacemaker
>
> DRBD® and LINBIT® are registered trademarks of LINBIT
> __
> please don't Cc me, but send to list -- I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user


More information about the drbd-user mailing list