<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Greetings!</div><div><br></div><div>I&#39;ve been experiencing and troubleshooting this problem for several months now with little success.</div><div><br></div><div>I&#39;m using DRBD 8.4.11-1 in a 2-node, dual-primary cluster on CentOS 7.5.1804. This cluster is a HA virtualization solution based on KVM. Randomly, maybe once a month or so, the DRBD service on node2 will fail to finish a write request from node1 (sock_sendmsg time expired) and fencing is initiated by node1 which results in an IPMI reboot of node2. From what I can tell, there is increased disk activity on node1 that node2 can&#39;t keep up with. Hardware of the nodes is identical and the DRBD replication occurs over a dedicated, redundant 10G connection.<br></div><div><br></div><div>I&#39;ll start by including some basic, sanitized configs and log messages. I can provide pretty detailed performance metrics from sysstat if necessary. Any help in troubleshooting this mystery is greatly appreciated. Please let me know if you need any other information. Thanks.</div><div><br></div><div>DRBD configuration: <a href="https://pastebin.com/aNB7uB4r">https://pastebin.com/aNB7uB4r</a><br></div><div>Node1 Logs: <a href="https://pastebin.com/aEhSWy1b">https://pastebin.com/aEhSWy1b</a><br></div><div>Node2 Logs: <a href="https://pastebin.com/sFU84BWZ">https://pastebin.com/sFU84BWZ</a></div><div>Hardware configuration: <a href="https://pastebin.com/jzRwxQeP">https://pastebin.com/jzRwxQeP</a><br></div><div><div><div dir="ltr" class="gmail_signature"><div>RAID configuration and info: <a href="https://pastebin.com/vnGsUkHW">https://pastebin.com/vnGsUkHW</a></div><div><br></div>-Chris H<br></div></div></div></div></div></div></div></div></div></div></div>