Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Wed, Apr 19, 2017 at 07:00:50AM +0000, Hochegger, Gerald wrote: > Hello, > > any suggestions how to proceed ? > > My DRBD setup > (2x Supermicro X10DRH, 4x p3700, CentOS 7) > crashes the remote node reproducible, when I run a > zfs test on DRBD mirrored NVME volumes. > > The test does not crash the remote node if: > * the remote DRBD resource is disconnected > (and later reconnected) > * DRBD resource is on SCSI disks > * the test is running on a NVME volume without DRBD > > Since the blk_mq code is quite new, is it possible > that drbd triggers a bug in blk_mq ? > > Should I try a newer kernel ? This: > [ 390.030938] kernel BUG at drivers/nvme/host/pci.c:467! is an explicit "BUG" statement. The only BUG statement in pci.c in upstream (I did not yet check the RHEL sources) is a BUG_ON(dma_len < 0) within nvme_setup_prps(). Driver does not like the alignment or size of some segment once dispatched. There is an upstream fix touching pci.c, which may or may not be relevant NVMe: default to 4k device page size ... This eventually trips the BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple of 4K but not 8K (e.g., 0xF000). ... Maybe that is missing from your version of the driver, maybe it has been fixed differently, or the fix is incomplete, or it is something unrelated. I'm not that fluent yet in the nvme driver source and history. It may also be something we need to do differently in DRBD, but since it "works with all other backends", and we have some NVMEs in the lab that did not show any such problems yet either, I have nothing less vague to offer at this point. -- : Lars Ellenberg : LINBIT | Keeping the Digital World Running : DRBD -- Heartbeat -- Corosync -- Pacemaker DRBD® and LINBIT® are registered trademarks of LINBIT __ please don't Cc me, but send to list -- I'm subscribed