[DRBD-user] DRBD Occasional Lockup

Mon Oct 29 20:33:38 CET 2018

Greetings!

I've been experiencing and troubleshooting this problem for several months
now with little success.

I'm using DRBD 8.4.11-1 in a 2-node, dual-primary cluster on CentOS
7.5.1804. This cluster is a HA virtualization solution based on KVM.
Randomly, maybe once a month or so, the DRBD service on node2 will fail to
finish a write request from node1 (sock_sendmsg time expired) and fencing
is initiated by node1 which results in an IPMI reboot of node2. From what I
can tell, there is increased disk activity on node1 that node2 can't keep
up with. Hardware of the nodes is identical and the DRBD replication occurs
over a dedicated, redundant 10G connection.

I'll start by including some basic, sanitized configs and log messages. I
can provide pretty detailed performance metrics from sysstat if necessary.
Any help in troubleshooting this mystery is greatly appreciated. Please let
me know if you need any other information. Thanks.

DRBD configuration: https://pastebin.com/aNB7uB4r
Node1 Logs: https://pastebin.com/aEhSWy1b
Node2 Logs: https://pastebin.com/sFU84BWZ
Hardware configuration: https://pastebin.com/jzRwxQeP
RAID configuration and info: https://pastebin.com/vnGsUkHW

-Chris H
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20181029/5f632362/attachment.htm>