[DRBD-user] drbc 9.1.1 whole cluster blocked

Rene Peinthor rene.peinthor at linbit.com
Thu May 27 13:37:44 CEST 2021


Could still be related to this fix:

 * fix timeout detection after idle periods and for configs with ko-count
   when a disk on an a secondary stops delivering IO-completion events

So if you have a ko-count set, this should be fixed.
Or it is something completely different... ;)

Cheers,
Rene

On Thu, May 27, 2021 at 1:25 PM Andreas Pflug <pgadmin at pse-consulting.de>
wrote:

> I'm running a Proxmox cluster with 3 disk nodes and 3 diskless nodes
> with drbd 9.1.1. The disk nodes have storage on md raid6 (8 ssds each)
> with a journal on an optane device.
>
> Yesterday, the whole cluster was severely impacted when one node had
> write problems. There is no indication for any hardware problem, no
> events whatsoever. What happened, taken from the logs:
>
> - one diskless node reports "sending time expired" for some devices on a
> specific disk node. After 30 seconds, it disconnects those devices on
> that node.
> - the disk node logs state change to outdated.
> - After 80s, the disk node logs "task blocked for more than 120
> seconds". These tasks are 8 drbd_r_xxx processes, but also md2_reclaim.
> - No more logging after that.
>
> After that, the whole cluster was severely impacted, most vms
> unresponsive. The node hosts were still accessible, with no more kernel
> logging.
>
> After analyzing the situation, assuming a single node would block
> everything, that node was rebooted (no normal reboot possible, needed
> "echo b >/proc/sysrq-trigger"). This did help, everything back to normal.
>
> So apparently there are situations when a backing storage problem might
> block all drbd processing in a way that prevents normal timeout
> detection and subsequent disconnection on other nodes. Reading the 9.1.2
> release notes, this doesn't seem to be addressed there.
>
> Regards,
> Andreas
>
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user at lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20210527/f4da36d0/attachment.htm>


More information about the drbd-user mailing list