Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, any suggestions how to proceed ? My DRBD setup (2x Supermicro X10DRH, 4x p3700, CentOS 7) crashes the remote node reproducible, when I run a zfs test on DRBD mirrored NVME volumes. The test does not crash the remote node if: * the remote DRBD resource is disconnected (and later reconnected) * DRBD resource is on SCSI disks * the test is running on a NVME volume without DRBD Since the blk_mq code is quite new, is it possible that drbd triggers a bug in blk_mq ? Should I try a newer kernel ? Gerald [ 390.030908] ------------[ cut here ]------------ [ 390.030938] kernel BUG at drivers/nvme/host/pci.c:467! [ 390.030961] invalid opcode: 0000 [#1] SMP [ 390.031550] CPU: 4 PID: 4105 Comm: drbd_r_test- Tainted: G OE ------------ 3.10.0-514.10.2.el7.x86_64 #1 [ 390.031591] Hardware name: Supermicro X10DRH/X10DRH-IT, BIOS 2.0a 06/30/2016 [ 390.031619] task: ffff883fdb66af10 ti: ffff881fe4e40000 task.ti: ffff881fe4e40000 [ 390.031649] RIP: 0010:[<ffffffffa0370fd8>] [<ffffffffa0370fd8>] nvme_queue_rq+0xa58/0xa70 [nvme] [ 390.031693] RSP: 0018:ffff881fe4e43b38 EFLAGS: 00010286 [ 390.031715] RAX: 0000000000000000 RBX: 00000000ffffe800 RCX: 0000000000002600 [ 390.031743] RDX: 0000003ff3faa200 RSI: ffff883ff3faa200 RDI: 0000000000000246 [ 390.031770] RBP: ffff881fe4e43c10 R08: 0000000000001000 R09: 0000001ffdc33000 [ 390.031798] R10: 00000000fffff800 R11: ffff881ff17ed980 R12: ffff883ff3faa200 [ 390.031826] R13: 0000000000000001 R14: 0000000000001000 R15: 0000000000002600 [ 390.031854] FS: 0000000000000000(0000) GS:ffff881fffb00000(0000) knlGS:0000000000000000 [ 390.031885] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 390.031908] CR2: 00007f5c8ae99e00 CR3: 00000000019ba000 CR4: 00000000001407e0 [ 390.031936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 390.031964] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 390.031991] Stack: [ 390.032002] ffff883ffd423c00 ffff883ffa9d6b80 ffff88407ff02300 ffff881ff17eda20 [ 390.032037] ffff883ff8821d40 ffff88407feb2100 0000001ffdc32000 ffff882000003600 [ 390.032072] ffff883ffa8e63c0 0000000000001000 ffff881f00000200 ffff884000000001 [ 390.032107] Call Trace: [ 390.032125] [<ffffffff812f8a5a>] __blk_mq_run_hw_queue+0x1fa/0x3c0 [ 390.032153] [<ffffffff812f8835>] blk_mq_run_hw_queue+0xa5/0xd0 [ 390.032178] [<ffffffff812f9b3b>] blk_mq_insert_requests+0xcb/0x160 [ 390.032203] [<ffffffff812fa89b>] blk_mq_flush_plug_list+0x13b/0x160 [ 390.032230] [<ffffffff812f0059>] blk_flush_plug_list+0xc9/0x230 [ 390.032255] [<ffffffff812f0574>] blk_finish_plug+0x14/0x40 [ 390.032288] [<ffffffffa02dc158>] drbd_unplug_all_devices+0x38/0x50 [drbd] [ 390.032320] [<ffffffffa02dc58f>] receive_UnplugRemote+0x4f/0x70 [drbd] [ 390.032352] [<ffffffffa02eb210>] drbd_receiver+0x150/0x350 [drbd] [ 390.032384] [<ffffffffa02f6500>] ? drbd_destroy_connection+0x160/0x160 [drbd] [ 390.032417] [<ffffffffa02f651d>] drbd_thread_setup+0x1d/0x110 [drbd] [ 390.032448] [<ffffffffa02f6500>] ? drbd_destroy_connection+0x160/0x160 [drbd] [ 390.032479] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 390.032501] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [ 390.032530] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 [ 390.032553] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [ 390.032579] Code: 1f 44 00 00 e9 32 f9 ff ff 8b 73 28 f0 66 83 43 28 02 f6 43 2a 01 74 e5 48 8b 7d 80 e8 d5 41 00 00 eb da 41 bd 02 00 00 00 eb c8 <0f> 0b 4c 8b 0d 7f 38 66 e1 e9 33 ff ff ff 66 2e 0f 1f 84 00 00 [ 390.032746] RIP [<ffffffffa0370fd8>] nvme_queue_rq+0xa58/0xa70 [nvme] [ 390.032779] RSP <ffff881fe4e43b38> On 2017-04-05 18:46, Gerald Hochegger wrote: > Hello, > > I (and some others) have problems with DRDB over NVME devices (Intel > p3700) at least under CentOS 7 (drdb 8.4.9 and 9.0.6) > > The remote kernel crashes with: > kernel BUG at drivers/nvme/host/pci.c:467 > > It seems this bug is related with DRBD - running DRDB over > SCSI disks does not trigger this crash > > Details here: > https://bugs.centos.org/view.php?id=13063 > > Could you please look at this report. > > Gerald