[DRBD-user] DRBD 8.4 two-node primary locks up, did not send a P_BARRIER
Sven-Erik Neve
sven-erik.neve at solactive.com
Tue Nov 23 17:49:41 CET 2021
Hi all,
at work my team and I are facing a DRBD 8.4 two-node cluster where the
primary node seemingly randomly locks up, thereby preventing access to
its data.
When this happens dmesg shows entries such as this one coming from DRBD:
We did not send a P_BARRIER for 5118944ms > ko-count (7) * timeout (60 *
0.1s); drbd kernel thread blocked?
At the same time DRBD commands such as 'drbdadm secondary' or viewing
state at '/proc/drbd' no longer return results and just hang.
This is happening on a Debian 10 Xen virtual machine (via XCP-ng). The
installed 'drbd-utils' Debian package is version 9.5.0-1. The 'drbd.ko'
module is version 8.4.10. Kernel is 4.19.208-1 installed via package
'linux-image-4.19.0-18-amd64'. The config as shown via 'drbdadm dump' is
available at Pastebin: https://pastebin.com/raw/b122wQU9.
The DRBD cluster is used as a block device for a ZFS zpool where ZFS
itself is version '2.0.3-9~bpo10+1'-
Systems monitoring suggests that the issue occurs when disk load
measured in I/O wait time is higher than usual. Since we've now seen
this situation only twice that's not much of a pattern yet. Despite disk
load seemingly being an issue none of the other virtual machine tenants
on the same hypervisor and disk array are facing issues. The underlying
disks are an SSD-based RAID 10 array of 4 disks total which are not
exhibiting suspicious behavior or metrics. Does anyone have any pointers
as to what might be going on here?
Google suggests RAM might be an issue, however, in both instances when
this happened the node in question had about 15 GiB of free RAM out of a
total of 48 GiB.
Just for fun we're thinking about testing a Debian 11 backports kernel
but don't have any concrete direction to go in.
Any and all hints are greatly appreciated, thanks!
More information about the drbd-user
mailing list