[DRBD-user] DRBD over NVME - kernel BUG at drivers/nvme/host/pci.c:467

Hochegger, Gerald gerald.hochegger at aau.at
Tue May 2 22:32:39 CEST 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On 2017-04-24 14:55, Lars Ellenberg wrote:
> On Wed, Apr 19, 2017 at 07:00:50AM +0000, Hochegger, Gerald wrote:
>> Hello,
>>
>> any suggestions how to proceed ?
>>
>> My DRBD setup
>> (2x Supermicro X10DRH, 4x p3700, CentOS 7)
>> crashes the remote node reproducible, when I run a
>> zfs test on DRBD mirrored NVME volumes.
>>
>> The test does not crash the remote node if:
>> * the remote DRBD resource is disconnected
>>   (and later reconnected)
>> * DRBD resource is on SCSI disks
>> * the test is running on a NVME volume without DRBD
>>
>> Since the blk_mq code is quite new, is it possible
>> that drbd triggers a bug in blk_mq ?
>>
>> Should I try a newer kernel ?
>
> This:
>> [  390.030938] kernel BUG at drivers/nvme/host/pci.c:467!
>
> is an explicit "BUG" statement.
>
> The only BUG statement in pci.c in upstream (I did not yet check the
> RHEL sources) is a BUG_ON(dma_len < 0) within nvme_setup_prps().
>
> Driver does not like the alignment or size of some segment once dispatched.
>
> There is an upstream fix touching pci.c, which may or may not be relevant
>   NVMe: default to 4k device page size
>   ...
>   This eventually trips the BUG_ON in nvme_setup_prps(), as we have a
>   'dma_len' that is a multiple of 4K but not 8K (e.g., 0xF000).
>   ...
>
> Maybe that is missing from your version of the driver,
> maybe it has been fixed differently, or the fix is incomplete,
> or it is something unrelated.
> I'm not that fluent yet in the nvme driver source and history.
>
> It may also be something we need to do differently in DRBD,
> but since it "works with all other backends", and we have some
> NVMEs in the lab that did not show any such problems yet either,
> I have nothing less vague to offer at this point.
>

Thanks for the answer - I'll try newer kernels but I do not
have access to those systems at the moment - they are in production
already (we mirror inside the VMs instead of DRBD in the backend)

Will test again in the next maintenance window.

Gerald



More information about the drbd-user mailing list