[DRBD-user] drbc 9.1.1 whole cluster blocked

Thu May 27 13:13:03 CEST 2021

I'm running a Proxmox cluster with 3 disk nodes and 3 diskless nodes
with drbd 9.1.1. The disk nodes have storage on md raid6 (8 ssds each)
with a journal on an optane device.

Yesterday, the whole cluster was severely impacted when one node had
write problems. There is no indication for any hardware problem, no
events whatsoever. What happened, taken from the logs:

- one diskless node reports "sending time expired" for some devices on a
specific disk node. After 30 seconds, it disconnects those devices on
that node.
- the disk node logs state change to outdated.
- After 80s, the disk node logs "task blocked for more than 120
seconds". These tasks are 8 drbd_r_xxx processes, but also md2_reclaim.
- No more logging after that.

After that, the whole cluster was severely impacted, most vms
unresponsive. The node hosts were still accessible, with no more kernel
logging.

After analyzing the situation, assuming a single node would block
everything, that node was rebooted (no normal reboot possible, needed
"echo b >/proc/sysrq-trigger"). This did help, everything back to normal.

So apparently there are situations when a backing storage problem might
block all drbd processing in a way that prevents normal timeout
detection and subsequent disconnection on other nodes. Reading the 9.1.2
release notes, this doesn't seem to be addressed there.

Regards,
Andreas