[DRBD-user] drbc 9.1.1 whole cluster blocked

Andreas Pflug pgadmin at pse-consulting.de
Thu May 27 13:47:46 CEST 2021


No ko-count set, so apparently something different...


Am 27.05.21 um 13:37 schrieb Rene Peinthor:
> Could still be related to this fix:
> 
>  * fix timeout detection after idle periods and for configs with ko-count
>    when a disk on an a secondary stops delivering IO-completion events
> 
> So if you have a ko-count set, this should be fixed.
> Or it is something completely different... ;)
> 
> Cheers,
> Rene
> 
> On Thu, May 27, 2021 at 1:25 PM Andreas Pflug <pgadmin at pse-consulting.de
> <mailto:pgadmin at pse-consulting.de>> wrote:
> 
>     I'm running a Proxmox cluster with 3 disk nodes and 3 diskless nodes
>     with drbd 9.1.1. The disk nodes have storage on md raid6 (8 ssds each)
>     with a journal on an optane device.
> 
>     Yesterday, the whole cluster was severely impacted when one node had
>     write problems. There is no indication for any hardware problem, no
>     events whatsoever. What happened, taken from the logs:
> 
>     - one diskless node reports "sending time expired" for some devices on a
>     specific disk node. After 30 seconds, it disconnects those devices on
>     that node.
>     - the disk node logs state change to outdated.
>     - After 80s, the disk node logs "task blocked for more than 120
>     seconds". These tasks are 8 drbd_r_xxx processes, but also md2_reclaim.
>     - No more logging after that.
> 
>     After that, the whole cluster was severely impacted, most vms
>     unresponsive. The node hosts were still accessible, with no more kernel
>     logging.
> 
>     After analyzing the situation, assuming a single node would block
>     everything, that node was rebooted (no normal reboot possible, needed
>     "echo b >/proc/sysrq-trigger"). This did help, everything back to
>     normal.
> 
>     So apparently there are situations when a backing storage problem might
>     block all drbd processing in a way that prevents normal timeout
>     detection and subsequent disconnection on other nodes. Reading the 9.1.2
>     release notes, this doesn't seem to be addressed there.
> 
>     Regards,
>     Andreas
> 
>     _______________________________________________
>     Star us on GITHUB: https://github.com/LINBIT <https://github.com/LINBIT>
>     drbd-user mailing list
>     drbd-user at lists.linbit.com <mailto:drbd-user at lists.linbit.com>
>     https://lists.linbit.com/mailman/listinfo/drbd-user
>     <https://lists.linbit.com/mailman/listinfo/drbd-user>
> 



More information about the drbd-user mailing list