[DRBD-user] DRBD over NVME - kernel BUG at drivers/nvme/host/pci.c:467

Wed Apr 19 09:00:50 CEST 2017

Hello,

any suggestions how to proceed ?

My DRBD setup
(2x Supermicro X10DRH, 4x p3700, CentOS 7)
crashes the remote node reproducible, when I run a
zfs test on DRBD mirrored NVME volumes.

The test does not crash the remote node if:
* the remote DRBD resource is disconnected
  (and later reconnected)
* DRBD resource is on SCSI disks
* the test is running on a NVME volume without DRBD

Since the blk_mq code is quite new, is it possible
that drbd triggers a bug in blk_mq ?

Should I try a newer kernel ?

Gerald

[  390.030908] ------------[ cut here ]------------
[  390.030938] kernel BUG at drivers/nvme/host/pci.c:467!
[  390.030961] invalid opcode: 0000 [#1] SMP
[  390.031550] CPU: 4 PID: 4105 Comm: drbd_r_test- Tainted: G 
OE  ------------   3.10.0-514.10.2.el7.x86_64 #1
[  390.031591] Hardware name: Supermicro X10DRH/X10DRH-IT, BIOS 2.0a 
06/30/2016
[  390.031619] task: ffff883fdb66af10 ti: ffff881fe4e40000 task.ti: 
ffff881fe4e40000
[  390.031649] RIP: 0010:[<ffffffffa0370fd8>]  [<ffffffffa0370fd8>] 
nvme_queue_rq+0xa58/0xa70 [nvme]
[  390.031693] RSP: 0018:ffff881fe4e43b38  EFLAGS: 00010286
[  390.031715] RAX: 0000000000000000 RBX: 00000000ffffe800 RCX: 
0000000000002600
[  390.031743] RDX: 0000003ff3faa200 RSI: ffff883ff3faa200 RDI: 
0000000000000246
[  390.031770] RBP: ffff881fe4e43c10 R08: 0000000000001000 R09: 
0000001ffdc33000
[  390.031798] R10: 00000000fffff800 R11: ffff881ff17ed980 R12: 
ffff883ff3faa200
[  390.031826] R13: 0000000000000001 R14: 0000000000001000 R15: 
0000000000002600
[  390.031854] FS:  0000000000000000(0000) GS:ffff881fffb00000(0000) 
knlGS:0000000000000000
[  390.031885] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  390.031908] CR2: 00007f5c8ae99e00 CR3: 00000000019ba000 CR4: 
00000000001407e0
[  390.031936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  390.031964] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[  390.031991] Stack:
[  390.032002]  ffff883ffd423c00 ffff883ffa9d6b80 ffff88407ff02300 
ffff881ff17eda20
[  390.032037]  ffff883ff8821d40 ffff88407feb2100 0000001ffdc32000 
ffff882000003600
[  390.032072]  ffff883ffa8e63c0 0000000000001000 ffff881f00000200 
ffff884000000001
[  390.032107] Call Trace:
[  390.032125]  [<ffffffff812f8a5a>] __blk_mq_run_hw_queue+0x1fa/0x3c0
[  390.032153]  [<ffffffff812f8835>] blk_mq_run_hw_queue+0xa5/0xd0
[  390.032178]  [<ffffffff812f9b3b>] blk_mq_insert_requests+0xcb/0x160
[  390.032203]  [<ffffffff812fa89b>] blk_mq_flush_plug_list+0x13b/0x160
[  390.032230]  [<ffffffff812f0059>] blk_flush_plug_list+0xc9/0x230
[  390.032255]  [<ffffffff812f0574>] blk_finish_plug+0x14/0x40
[  390.032288]  [<ffffffffa02dc158>] drbd_unplug_all_devices+0x38/0x50 
[drbd]
[  390.032320]  [<ffffffffa02dc58f>] receive_UnplugRemote+0x4f/0x70 [drbd]
[  390.032352]  [<ffffffffa02eb210>] drbd_receiver+0x150/0x350 [drbd]
[  390.032384]  [<ffffffffa02f6500>] ? 
drbd_destroy_connection+0x160/0x160 [drbd]
[  390.032417]  [<ffffffffa02f651d>] drbd_thread_setup+0x1d/0x110 [drbd]
[  390.032448]  [<ffffffffa02f6500>] ? 
drbd_destroy_connection+0x160/0x160 [drbd]
[  390.032479]  [<ffffffff810b06ff>] kthread+0xcf/0xe0
[  390.032501]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
[  390.032530]  [<ffffffff81696a58>] ret_from_fork+0x58/0x90
[  390.032553]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
[  390.032579] Code: 1f 44 00 00 e9 32 f9 ff ff 8b 73 28 f0 66 83 43 28 
02 f6 43 2a 01 74 e5 48 8b 7d 80 e8 d5 41 00 00 eb da 41 bd 02 00 00 00 
eb c8 <0f> 0b 4c 8b 0d 7f 38 66 e1 e9 33 ff ff ff 66 2e 0f 1f 84 00 00
[  390.032746] RIP  [<ffffffffa0370fd8>] nvme_queue_rq+0xa58/0xa70 [nvme]
[  390.032779]  RSP <ffff881fe4e43b38>

On 2017-04-05 18:46, Gerald Hochegger wrote:
> Hello,
>
> I (and some others) have problems with DRDB over NVME devices (Intel
> p3700) at least under CentOS 7 (drdb 8.4.9 and 9.0.6)
>
> The remote kernel crashes with:
> kernel BUG at drivers/nvme/host/pci.c:467
>
> It seems this bug is related with DRBD - running DRDB over
> SCSI disks does not trigger this crash
>
> Details here:
> https://bugs.centos.org/view.php?id=13063
>
> Could you please look at this report.
>
> Gerald