[DRBD-user] DRBD 8.4 two-node primary locks up, did not send a P_BARRIER

Tue Nov 23 17:49:41 CET 2021

Hi all,

at work my team and I are facing a DRBD 8.4 two-node cluster where the 
primary node seemingly randomly locks up, thereby preventing access to 
its data.

When this happens dmesg shows entries such as this one coming from DRBD:

We did not send a P_BARRIER for 5118944ms > ko-count (7) * timeout (60 * 
0.1s); drbd kernel thread blocked?

At the same time DRBD commands such as 'drbdadm secondary' or viewing 
state at '/proc/drbd' no longer return results and just hang.

This is happening on a Debian 10 Xen virtual machine (via XCP-ng). The 
installed 'drbd-utils' Debian package is version 9.5.0-1. The 'drbd.ko' 
module is version 8.4.10. Kernel is 4.19.208-1 installed via package 
'linux-image-4.19.0-18-amd64'. The config as shown via 'drbdadm dump' is 
available at Pastebin: https://pastebin.com/raw/b122wQU9.

The DRBD cluster is used as a block device for a ZFS zpool where ZFS 
itself is version '2.0.3-9~bpo10+1'-

Systems monitoring suggests that the issue occurs when disk load 
measured in I/O wait time is higher than usual. Since we've now seen 
this situation only twice that's not much of a pattern yet. Despite disk 
load seemingly being an issue none of the other virtual machine tenants 
on the same hypervisor and disk array are facing issues. The underlying 
disks are an SSD-based RAID 10 array of 4 disks total which are not 
exhibiting suspicious behavior or metrics. Does anyone have any pointers 
as to what might be going on here?

Google suggests RAM might be an issue, however, in both instances when 
this happened the node in question had about 15 GiB of free RAM out of a 
total of 48 GiB.

Just for fun we're thinking about testing a Debian 11 backports kernel 
but don't have any concrete direction to go in.

Any and all hints are greatly appreciated, thanks!