Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello We are running DRBD 8.3.12 in a dual primary system. On top of the 3 DRBD resources we run CLVM, and KVM virtual machines running from these. Setup of the cluster followed Alteve's tutorial https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial We have 5 virtual machines, 2 of which are Windows Server 2008 (one is SBS 2011), the others linux. All run fine, as far as I can tell, most of the time. The problem we have is when the SBS2011 guest VM is restarted. This did not happen when the server was first installed, but the last few reboots has done. DRBD/KVM Host 1 Apr 25 21:24:42 oberon kernel: block drbd2: sock was shut down by peer Apr 25 21:24:42 oberon kernel: block drbd2: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) Apr 25 21:24:42 oberon kernel: block drbd2: short read expecting header on sock: r=0 Apr 25 21:24:42 oberon kernel: block drbd2: asender terminated Apr 25 21:24:42 oberon kernel: block drbd2: Terminating asender thread (Host 1 is STONITHed at this point) DRBD/Host 2 Apr 25 21:24:42 titania kernel: block drbd2: PingAck did not arrive in time. Apr 25 21:24:42 titania kernel: block drbd2: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) Apr 25 21:24:42 titania kernel: block drbd2: asender terminated Apr 25 21:24:42 titania kernel: block drbd2: Terminating asender thread Apr 25 21:24:42 titania kernel: block drbd2: Connection closed Apr 25 21:24:42 titania kernel: block drbd2: conn( NetworkFailure -> Unconnected ) Host 2 continues, brings up the 2 VMs sucessfully, etc. I assume the ping not arriving in time to host 2 causes the socket to shut down on host 1? The ping time out is the default 5/10'th sec. Why is it timing out when this guest VM is rebooted? The 2 host servers are have a dedicated Intel 10 Gigabit AT2 adaptor for DRBD. I have a feeling this may have started after when the guest Windows VM had more memory assigned, from about 15Gb to 20Gb, and I wonder if Windows is writing some large memory dump when rebooting which pushes DRBD's replication too far? Simply upping the ping timeout seems like the wrong solution, but is the only thing I can think of. Any suggestions welcome. Cheers Alastair Battrick