[DRBD-user] kernel BUG on new server with drbd 9.0.5

Laszlo Fiat laszlo.fiat at gmail.com
Thu Nov 24 15:14:13 CET 2016

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

I have drbd 8.x in production for many years, without problems.
I have a single primary setup, SAS disks are in hardware RAID5,
partitions are assigned to drbd.
On top of drbd, there are virtual machines, which use drbd resources
as raw disks.

I migrate from an old pair of servers to a new pair of servers and to
drbd 9.0.5. It went well until last midnight.
Resource R5 was doing initial synchronization between the two new
servers, both running drbd 9.0.5.
Resource R4 probably received some read/write load on the primary
node, as it ran nightly backup.
The bug came on the server that was the target of the synchronisation,
and drbd secondary.

OS: Debian Jessie (stable) with Debian's stock kernel, and I compiled
drbd 9.0.5 myself from the tarball.

I'd like to know if this is a drbd bug, or a hardware issue?

Best regards,

Laszlo Fiat

Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.104776] drbd r4/0
drbd4: ASSERTION test_bit(BME_NO_WRITES, &bm_ext->flags) FAILED in
drbd_try_rs_begin_io
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.107260] ------------[
cut here ]------------
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.109802] kernel BUG at
/home/lanlaf/data/comp/drbd-9.0.5-1/drbd/lru_cache.c:571!
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.112623] invalid
opcode: 0000 [#1] SMP
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.115433] Modules linked
in: drbd_transport_tcp(O) drbd(O) libcrc32c crc32c_generic bridge stp
llc nls_utf8 nls_cp437 x86_pkg_temp_thermal vfat fat coretemp
kvm_intel kvm ttm drm_kms_helper crc32_pclmul drm hid_generic
aesni_intel aes_x86_64 cdc_ether usbnet mii lrw usbhid i2c_algo_bit
gf128mul hid efi_pstore glue_helper ablk_helper efivars evdev
ipmi_devintf iTCO_wdt iTCO_vendor_support pcspkr acpi_power_meter
shpchp lpc_ich mxm_wmi i2c_i801 i2c_core mfd_core cryptd processor
tpm_tis ipmi_si thermal_sys ipmi_msghandler tpm acpi_pad button mei_me
mei wmi autofs4 ext4 crc16 mbcache jbd2 sd_mod sg crc_t10dif ses
crct10dif_generic enclosure crct10dif_pclmul crct10dif_common
crc32c_intel ehci_pci tg3 ehci_hcd xhci_hcd ptp pps_core megaraid_sas
libphy usbcore usb_common scsi_mod
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.144663] CPU: 5 PID:
21081 Comm: drbd_s_r4 Tainted: G           O  3.16.0-4-amd64 #1 Debian
3.16.36-1+deb8u2
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.149998] Hardware name:
LENOVO System x3650 M5: -[8871AC1]-/01GR174, BIOS -[TCE126O-2.20]-
09/07/2016
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.155584] task:
ffff88103c374ba0 ti: ffff88100da60000 task.ti: ffff88100da60000
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.161358] RIP:
0010:[<ffffffffa052a224>]  [<ffffffffa052a224>] lc_put+0x94/0xa0
[drbd]
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.167336] RSP:
0018:ffff88100da63d50  EFLAGS: 00010046
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.173415] RAX:
0000000000000000 RBX: ffff88103fb79c00 RCX: 0000000000000000
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.179677] RDX:
0000000000000036 RSI: ffff881034600410 RDI: ffff881078679880
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.186011] RBP:
ffff881033c40800 R08: 0000000000000000 R09: 0000000000148b54
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.192478] R10:
0000000000000080 R11: 0000000000008000 R12: ffff881034600410
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.199015] R13:
0000000000000709 R14: ffff88103fb79c00 R15: ffff88103445f8d8
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.212529] FS:
0000000000000000(0000) GS:ffff88107f140000(0000)
knlGS:0000000000000000
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.242192] CS:  0010 DS:
0000 ES: 0000 CR0: 0000000080050033
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.249873] CR2:
00000000009f8018 CR3: 0000000001813000 CR4: 00000000003407e0
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.257766] DR0:
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.265696] DR3:
0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.273669] Stack:
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.281628]
ffffffffa05288ee 000000001e387285 0038728500000000 ffff881033c40800
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.289980]
00000000038491d6 000000001c248eb0 ffff88103445f450 ffff88103fb79c00
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.298442]
ffff88103445f8d8 ffffffffa050852a ffff88103445f000 ffffffff0000009c
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.402492] Call Trace:
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.411026]
[<ffffffffa05288ee>] ? drbd_try_rs_begin_io+0x9e/0x4f0 [drbd]
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.419798]
[<ffffffffa050852a>] ? w_resync_timer+0x36a/0xed0 [drbd]
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.428472]
[<ffffffffa050b6ca>] ? drbd_sender+0x16a/0x3d0 [drbd]
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.436952]
[<ffffffffa052a5a0>] ? w_complete+0x20/0x20 [drbd]
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.445337]
[<ffffffffa052a5ff>] ? drbd_thread_setup+0x5f/0x110 [drbd]
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.453625]
[<ffffffffa052a5a0>] ? w_complete+0x20/0x20 [drbd]
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.461742]
[<ffffffff810894bd>] ? kthread+0xbd/0xe0
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.469599]
[<ffffffff81089400>] ? kthread_create_on_node+0x180/0x180
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.477281]
[<ffffffff815184d8>] ? ret_from_fork+0x58/0x90
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.484697]
[<ffffffff81089400>] ? kthread_create_on_node+0x180/0x180
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.491970] Code: 89 56 10
48 89 7e 18 48 89 07 83 6f 64 01 f0 80 a7 90 00 00 00 f7 f0 80 a7 90
00 00 00 fe 8b 46 20 c3 0f 1f
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.506695] 44 00 00 0f 0b
0f 0b <0f> 0b 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 39 77 58 76
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.520902] RIP
[<ffffffffa052a224>] lc_put+0x94/0xa0 [drbd]
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.527910]  RSP <ffff88100da63d50>
Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.552188] ---[ end trace
b15348e3853c93f5 ]---

A bit later:

Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.793884] INFO:
rcu_sched detected stalls on CPUs/tasks: { 5} (detected by 14, t=5252
jiffies, g=5339142, c=5339141, q=820)
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.803094] sending NMI to all CPUs:
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.807792] NMI backtrace for cpu 0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.812351] CPU: 0 PID: 0
Comm: swapper/0 Tainted: G      D    O  3.16.0-4-amd64 #1 Debian
3.16.36-1+deb8u2
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.817202] Hardware name:
LENOVO System x3650 M5: -[8871AC1]-/01GR174, BIOS -[TCE126O-2.20]-
09/07/2016
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.822109] task:
ffffffff8181a460 ti: ffffffff81800000 task.ti: ffffffff81800000
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.827186] RIP:
0010:[<ffffffff8101c468>]  [<ffffffff8101c468>]
native_read_tsc+0x8/0x10
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.832482] RSP:
0018:ffff88107f003ef8  EFLAGS: 00000006
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.837871] RAX:
000000002d253f14 RBX: ffffffff8181f7c0 RCX: 0000000000000000
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.843473] RDX:
000626c900000000 RSI: 0000000000000000 RDI: ffffffff8181f7c0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.849194] RBP:
000000000009cf09 R08: 0000000000000611 R09: 0000000000000100
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.855044] R10:
0000000000006f1e R11: 00000000f2660034 R12: 000000000579c512
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.860964] R13:
0000000000000044 R14: 0000000000000000 R15: ffff88107f00db00
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.866925] FS:
0000000000000000(0000) GS:ffff88107f000000(0000)
knlGS:0000000000000000
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.873065] CS:  0010 DS:
0000 ES: 0000 CR0: 0000000080050033
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.879282] CR2:
00007fe669b64028 CR3: 0000000001813000 CR4: 00000000003407f0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.885663] DR0:
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.892136] DR3:
0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.898607] Stack:
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.905095]
ffffffff8101c4a5 ffffffff810c96be 0000000000000000 000000000000005d
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.911966]
0000000000000044 ffffffff810d1f13 ffffffff818e2ee0 0000000109930a80
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.918941]
ffffffff81803fd8 0020200000000009 ffffffff81803e28 000000000000005d
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.926020] Call Trace:
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.933107]  <IRQ>
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.933179]
[<ffffffff8101c4a5>] ? read_tsc+0x5/0x20
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.947512]
[<ffffffff810c96be>] ? ktime_get+0x3e/0xc0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.954874]
[<ffffffff810d1f13>] ? __tick_nohz_idle_enter+0x23/0x4a0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.962424]
[<ffffffff8106dcd8>] ? irq_exit+0x88/0xa0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.969975]
[<ffffffff8151b3a2>] ? do_IRQ+0x52/0xe0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.977580]
[<ffffffff815191ed>] ? common_interrupt+0x6d/0x6d
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.985315]  <EOI>
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643335.985387]
[<ffffffff813e33d2>] ? cpuidle_enter_state+0x52/0xc0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.000600]
[<ffffffff813e33c8>] ? cpuidle_enter_state+0x48/0xc0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.008295]
[<ffffffff810a9af8>] ? cpu_startup_entry+0x328/0x470
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.015866]
[<ffffffff81903076>] ? start_kernel+0x497/0x4a2
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.023429]
[<ffffffff81902a04>] ? set_init_arg+0x4e/0x4e
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.030788]
[<ffffffff81902120>] ? early_idt_handler_array+0x120/0x120
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.038127]
[<ffffffff8190271f>] ? x86_64_start_kernel+0x14d/0x15c
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.045424] Code: 90 00 00
00 00 00 01 00 e9 4d ff ff ff e9 54 ff ff ff 90 0f 1f 44 00 00 f3 c3
66 0f 1f 84 00 00 00 00 00 0f 31 48 c1 e2 20 89 c0 <48> 09 c2 48 89 d0
c3 90 8b 05 2e 3c 8c 00 c3 66 0f 1f 84 00 00
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.061237] NMI backtrace for cpu 4
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.061243] INFO: NMI
handler (arch_trigger_all_cpu_backtrace_handler) took too long to run:
253.445 msecs
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.076625] CPU: 4 PID:
18067 Comm: drbd_r_r3 Tainted: G      D    O  3.16.0-4-amd64 #1 Debian
3.16.36-1+deb8u2
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.084590] Hardware name:
LENOVO System x3650 M5: -[8871AC1]-/01GR174, BIOS -[TCE126O-2.20]-
09/07/2016
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.092801] task:
ffff88103b8e1630 ti: ffff88100decc000 task.ti: ffff88100decc000
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.101013] RIP:
0010:[<ffffffff8118fa42>]  [<ffffffff8118fa42>]
kmem_cache_alloc_node+0x32/0x4c0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.109409] RSP:
0018:ffff88100decfa40  EFLAGS: 00000246
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.117568] RAX:
0000000000000020 RBX: 00000000ffffffff RCX: 00000000ffffffff
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.125655] RDX:
00000000ffffffff RSI: 0000000000000020 RDI: ffff881078b772c0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.133750] RBP:
0000000000000020 R08: 00000000badf0c67 R09: 00000000000010f8
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.141900] R10:
ffff88103e9aa0d0 R11: 0000000000001000 R12: 0000000000000000
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.149908] R13:
0000000000000000 R14: ffff881078b772c0 R15: 00000000ffffffff
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.157777] FS:
0000000000000000(0000) GS:ffff88107f100000(0000)
knlGS:0000000000000000
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.165559] CS:  0010 DS:
0000 ES: 0000 CR0: 0000000080050033
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.173261] CR2:
00007f24eb56a000 CR3: 0000000001813000 CR4: 00000000003407e0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.180977] DR0:
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.188588] DR3:
0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.196142] Stack:
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.203575]
ffffffff81472779 000054f800000000 00ff88103e9aa040 ffff88103e44ec00
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.211354]
ffff88103e9aa040 ffff88103ea76c40 ffff880ff03e8822 ffff880ff03e880e
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.219194]
ffffffff812bed8b ffff88103e9aa040 0000000000000020 0000000000000000
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.227004] Call Trace:
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.234594]
[<ffffffff81472779>] ? tcp_rcv_established+0x3e9/0x6c0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.242302]
[<ffffffff812bed8b>] ? memcpy_toiovec+0x4b/0x80
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.250006]
[<ffffffff81412428>] ? __alloc_skb+0x48/0x2a0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.257765]
[<ffffffff814785b5>] ? tcp_send_ack+0x35/0x120
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.265426]
[<ffffffff81469e4d>] ? tcp_recvmsg+0x71d/0xc30
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.272962]
[<ffffffff8140e5f6>] ? release_sock+0xe6/0x170
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.280454]
[<ffffffff8149215a>] ? inet_recvmsg+0x6a/0x80
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.287866]
[<ffffffff81409e4a>] ? sock_recvmsg+0x9a/0xd0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.295142]
[<ffffffffa050ea8c>] ? __drbd_alloc_pages+0xdc/0x170 [drbd]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.302501]
[<ffffffff81409eb5>] ? kernel_recvmsg+0x35/0x40
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.309885]
[<ffffffffa03a05a2>] ? dtt_recv_short+0x52/0x60 [drbd_transport_tcp]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.317378]
[<ffffffffa03a07e5>] ? dtt_recv_pages+0xb5/0x140 [drbd_transport_tcp]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.324873]
[<ffffffffa0516ffe>] ? read_in_block+0x17e/0x380 [drbd]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.332336]
[<ffffffffa0518691>] ? receive_RSDataReply+0x101/0x550 [drbd]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.339914]
[<ffffffffa051f75f>] ? drbd_receiver+0x39f/0x650 [drbd]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.347509]
[<ffffffffa052a5a0>] ? w_complete+0x20/0x20 [drbd]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.355010]
[<ffffffffa052a5ff>] ? drbd_thread_setup+0x5f/0x110 [drbd]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.362488]
[<ffffffffa052a5a0>] ? w_complete+0x20/0x20 [drbd]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.370011]
[<ffffffff810894bd>] ? kthread+0xbd/0xe0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.377531]
[<ffffffff81089400>] ? kthread_create_on_node+0x180/0x180
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.385069]
[<ffffffff815184d8>] ? ret_from_fork+0x58/0x90
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.392627]
[<ffffffff81089400>] ? kthread_create_on_node+0x180/0x180
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.400206] Code: 41 56 49
89 fe 41 55 41 54 55 89 f5 53 89 d3 48 83 ec 48 8b 05 00 58 75 00 65
44 8b 2c 25 28 ef 00 00 44 89 6c 24 08 21 f0 a8 10 <89> 44 24 14 74 05
e8 c3 56 38 00 9c 58 0f 1f 44 00 00 48 89 44
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.416391] NMI backtrace for cpu 5
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.416397] INFO: NMI
handler (arch_trigger_all_cpu_backtrace_handler) took too long to run:
608.598 msecs
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.431820] CPU: 5 PID:
21091 Comm: drbd_a_r4 Tainted: G      D    O  3.16.0-4-amd64 #1 Debian
3.16.36-1+deb8u2
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.439675] Hardware name:
LENOVO System x3650 M5: -[8871AC1]-/01GR174, BIOS -[TCE126O-2.20]-
09/07/2016
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.455294] task:
ffff881034630350 ti: ffff88103d6b0000 task.ti: ffff88103d6b0000
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.462996] RIP:
0010:[<ffffffff8151808c>]  [<ffffffff8151808c>]
_raw_spin_lock_irqsave+0x3c/0x50
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.470732] RSP:
0018:ffff88103d6b3d70  EFLAGS: 00000002
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.478367] RAX:
000000000000e572 RBX: 000000000003dc6d RCX: 000000000000e572
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.486038] RDX:
000000000000e571 RSI: 0000000000000246 RDI: ffff88103fb79f20
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.493724] RBP:
0000000000000000 R08: ffff88107808f800 R09: 0000000000000000
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.501400] R10:
0000000000000000 R11: 0000000000001ee3 R12: ffff88103fb79c00
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.509111] R13:
000000000003dc6d R14: 0000000000000001 R15: ffff88103fb79f20
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.516747] FS:
0000000000000000(0000) GS:ffff88107f140000(0000)
knlGS:0000000000000000
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.524429] CS:  0010 DS:
0000 ES: 0000 CR0: 0000000080050033
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.532021] CR2:
00000000009f8018 CR3: 0000000001813000 CR4: 00000000003407e0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.539632] DR0:
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.547249] DR3:
0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.554847] Stack:
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.562301]
ffffffffa052575a ffff88103fb79c08 ffff88103a6089f0 ffff88103fb79c00
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.569936]
ffff88103d6b3d98 0000000000000003 ffff88103d6b3db8 00000000ffffffff
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.577575]
ffffffffa0516ba4 ffff88103a608a10 ffff88103a608a10 0000000000000020
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.585205] Call Trace:
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.592672]
[<ffffffffa052575a>] ? put_actlog+0x3a/0x120 [drbd]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.600226]
[<ffffffffa0516ba4>] ? got_peer_ack+0x194/0x370 [drbd]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.607796]
[<ffffffffa051fcc4>] ? drbd_ack_receiver+0x194/0x580 [drbd]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.615409]
[<ffffffffa052a5a0>] ? w_complete+0x20/0x20 [drbd]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.623016]
[<ffffffffa052a5ff>] ? drbd_thread_setup+0x5f/0x110 [drbd]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.630714]
[<ffffffffa052a5a0>] ? w_complete+0x20/0x20 [drbd]
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.638393]
[<ffffffff810894bd>] ? kthread+0xbd/0xe0
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.646091]
[<ffffffff81089400>] ? kthread_create_on_node+0x180/0x180
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.653875]
[<ffffffff815184d8>] ? ret_from_fork+0x58/0x90
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.661637]
[<ffffffff81089400>] ? kthread_create_on_node+0x180/0x180
Nov 24 00:13:25 hufot-s-apl0004 kernel: [643336.669407] Code: 0f 1f 44
00 00 ba 00 00 01 00 f0 0f c1 17 89 d1 c1 e9 10 66 39 d1 89 c8 75 04
48 89 f0 c3 0f b7 17 66 39 d1 74 f4 f3 90 0f b7 17 <66> 39 d0 75 f6 eb
e8 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f

[...]

Again a bit later:

Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013441] ------------[
cut here ]------------
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013485] WARNING: CPU:
5 PID: 21091 at
/build/linux-35gxh9/linux-3.16.36/kernel/watchdog.c:265
watchdog_overflow_callback+0x98/0xc0()
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013486] Watchdog
detected hard LOCKUP on cpu 5
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013857] Modules linked
in: drbd_transport_tcp(O) drbd(O) libcrc32c crc32c_generic bridge stp
llc nls_utf8 nls_cp437 x86_pkg_temp_thermal vfat fat coretemp
kvm_intel kvm ttm drm_kms_helper crc32_pclmul drm hid_generic
aesni_intel aes_x86_64 cdc_ether usbnet mii lrw usbhid i2c_algo_bit
gf128mul hid efi_pstore glue_helper ablk_helper efivars evdev
ipmi_devintf iTCO_wdt iTCO_vendor_support pcspkr acpi_power_meter
shpchp lpc_ich mxm_wmi i2c_i801 i2c_core mfd_core cryptd processor
tpm_tis ipmi_si thermal_sys ipmi_msghandler tpm acpi_pad button mei_me
mei wmi autofs4 ext4 crc16 mbcache jbd2 sd_mod sg crc_t10dif ses
crct10dif_generic enclosure crct10dif_pclmul crct10dif_common
crc32c_intel ehci_pci tg3 ehci_hcd xhci_hcd ptp pps_core megaraid_sas
libphy usbcore usb_common scsi_mod
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013865] CPU: 5 PID:
21091 Comm: drbd_a_r4 Tainted: G      D    O  3.16.0-4-amd64 #1 Debian
3.16.36-1+deb8u2
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013870] Hardware name:
LENOVO System x3650 M5: -[8871AC1]-/01GR174, BIOS -[TCE126O-2.20]-
09/07/2016
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013873]
0000000000000000 ffffffff815123b5 ffff88107f145c98 0000000000000009
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013875]
ffffffff81068817 ffff88107810fc00 ffff88107f145ce8 ffff88107f145d88
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013877]
0000000000000000 ffff88107f145ef8 ffffffff8106887c ffffffff8171d440
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013881] Call Trace:
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013913]  <NMI>
[<ffffffff815123b5>] ? dump_stack+0x5d/0x78
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013932]
[<ffffffff81068817>] ? warn_slowpath_common+0x77/0x90
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013935]
[<ffffffff8106887c>] ? warn_slowpath_fmt+0x4c/0x50
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013953]
[<ffffffff8108d7a6>] ? atomic_notifier_call_chain+0x16/0x20
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013990]
[<ffffffff8137df2c>] ? vt_console_print+0x23c/0x3f0
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.013996]
[<ffffffff810feeb8>] ? watchdog_overflow_callback+0x98/0xc0
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014049]
[<ffffffff81136557>] ? __perf_event_overflow+0x87/0x230
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014076]
[<ffffffff810309da>] ? intel_pmu_handle_irq+0x1ea/0x400
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014107]
[<ffffffff810285d6>] ? perf_event_nmi_handler+0x26/0x40
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014114]
[<ffffffff8101797b>] ? nmi_handle+0x8b/0x130
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014116]
[<ffffffff81017f2d>] ? default_do_nmi+0xbd/0x110
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014118]
[<ffffffff81018074>] ? do_nmi+0xf4/0x170
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014122]
[<ffffffff8151a98f>] ? end_repeat_nmi+0x1a/0x1e
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014128]
[<ffffffff8151808c>] ? _raw_spin_lock_irqsave+0x3c/0x50
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014131]
[<ffffffff8151808c>] ? _raw_spin_lock_irqsave+0x3c/0x50
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014134]
[<ffffffff8151808c>] ? _raw_spin_lock_irqsave+0x3c/0x50
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014188]  <<EOE>>
[<ffffffffa052575a>] ? put_actlog+0x3a/0x120 [drbd]
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014229]
[<ffffffffa0516ba4>] ? got_peer_ack+0x194/0x370 [drbd]
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014242]
[<ffffffffa051fcc4>] ? drbd_ack_receiver+0x194/0x580 [drbd]
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014253]
[<ffffffffa052a5a0>] ? w_complete+0x20/0x20 [drbd]
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014267]
[<ffffffffa052a5ff>] ? drbd_thread_setup+0x5f/0x110 [drbd]
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014305]
[<ffffffffa052a5a0>] ? w_complete+0x20/0x20 [drbd]
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014334]
[<ffffffff810894bd>] ? kthread+0xbd/0xe0
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014337]
[<ffffffff81089400>] ? kthread_create_on_node+0x180/0x180
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014341]
[<ffffffff815184d8>] ? ret_from_fork+0x58/0x90
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014344]
[<ffffffff81089400>] ? kthread_create_on_node+0x180/0x180
Nov 24 00:14:08 hufot-s-apl0004 kernel: [643380.014345] ---[ end trace
b15348e3853c93f6 ]---

The NMIs repeated many times during the night.

This task (drbd_a_r4) stayed with us until reboot, and kept 1 core at
100% CPU Load:

Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.871069] INFO: rcu_bh
detected stalls on CPUs/tasks: { 5} (detected by 15, t=4779280
jiffies, g=2146, c=2145, q=28)
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.879837] sending NMI to all CPUs:
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.888342] NMI backtrace for cpu 5
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.896579] CPU: 5 PID:
21091 Comm: drbd_a_r4 Tainted: G      D W  O  3.16.0-4-amd64 #1 Debian
3.16.36-1+deb8u2
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.905006] Hardware name:
LENOVO System x3650 M5: -[8871AC1]-/01GR174, BIOS -[TCE126O-2.20]-
09/07/2016
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.913440] task:
ffff881034630350 ti: ffff88103d6b0000 task.ti: ffff88103d6b0000
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.921915] RIP:
0010:[<ffffffff81518089>]  [<ffffffff81518089>]
_raw_spin_lock_irqsave+0x39/0x50
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.930637] RSP:
0018:ffff88103d6b3d70  EFLAGS: 00000002
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.939339] RAX:
000000000000e572 RBX: 000000000003dc6d RCX: 000000000000e572
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.948135] RDX:
000000000000e571 RSI: 0000000000000246 RDI: ffff88103fb79f20
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.956991] RBP:
0000000000000000 R08: ffff88107808f800 R09: 0000000000000000
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.965731] R10:
0000000000000000 R11: 0000000000001ee3 R12: ffff88103fb79c00
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.974445] R13:
000000000003dc6d R14: 0000000000000001 R15: ffff88103fb79f20
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.983031] FS:
0000000000000000(0000) GS:ffff88107f140000(0000)
knlGS:0000000000000000
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679825.991756] CS:  0010 DS:
0000 ES: 0000 CR0: 0000000080050033
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.000538] CR2:
00000000009f8018 CR3: 0000000001813000 CR4: 00000000003407e0
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.009349] DR0:
0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.017999] DR3:
0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.026569] Stack:
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.035029]
ffffffffa052575a ffff88103fb79c08 ffff88103a6089f0 ffff88103fb79c00
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.043686]
ffff88103d6b3d98 0000000000000003 ffff88103d6b3db8 00000000ffffffff
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.052293]
ffffffffa0516ba4 ffff88103a608a10 ffff88103a608a10 0000000000000020
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.060911] Call Trace:
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.069431]
[<ffffffffa052575a>] ? put_actlog+0x3a/0x120 [drbd]
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.078108]
[<ffffffffa0516ba4>] ? got_peer_ack+0x194/0x370 [drbd]
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.086639]
[<ffffffffa051fcc4>] ? drbd_ack_receiver+0x194/0x580 [drbd]
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.095024]
[<ffffffffa052a5a0>] ? w_complete+0x20/0x20 [drbd]
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.103277]
[<ffffffffa052a5ff>] ? drbd_thread_setup+0x5f/0x110 [drbd]
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.111537]
[<ffffffffa052a5a0>] ? w_complete+0x20/0x20 [drbd]
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.119760]
[<ffffffff810894bd>] ? kthread+0xbd/0xe0
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.127920]
[<ffffffff81089400>] ? kthread_create_on_node+0x180/0x180
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.136149]
[<ffffffff815184d8>] ? ret_from_fork+0x58/0x90
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.144249]
[<ffffffff81089400>] ? kthread_create_on_node+0x180/0x180
Nov 24 10:21:05 hufot-s-apl0004 kernel: [679826.152327] Code: c6 fa 66
0f 1f 44 00 00 ba 00 00 01 00 f0 0f c1 17 89 d1 c1 e9 10 66 39 d1 89
c8 75 04 48 89 f0 c3 0f b7 17 66 39 d1 74 f4 f3 90 <0f> b7 17 66 39 d0
75 f6 eb e8 66 66 66 66 2e 0f 1f 84 00 00 00

The secondary was available on the network, but it was sluggish to work with.

When I tried to reboot next morning, the reboot stalled, it couldn't
deconfigure the network, so I had to reboot using the power button...



More information about the drbd-user mailing list