[DRBD-user] Help with ping timeout

Fri Apr 26 12:31:31 CEST 2013

Hello

We are running DRBD 8.3.12 in a dual primary system. On top of the 3 
DRBD resources we run CLVM, and KVM virtual machines running from these. 
Setup of the cluster followed Alteve's tutorial
https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

We have 5 virtual machines, 2 of which are Windows Server 2008 (one is 
SBS 2011), the others linux. All run fine, as far as I can tell, most of 
the time.

The problem we have is when the SBS2011 guest VM is restarted. This did 
not happen when the server was first installed, but the last few reboots 
has done.

DRBD/KVM Host 1
Apr 25 21:24:42 oberon kernel: block drbd2: sock was shut down by peer
Apr 25 21:24:42 oberon kernel: block drbd2: peer( Primary -> Unknown ) 
conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Apr 25 21:24:42 oberon kernel: block drbd2: short read expecting header 
on sock: r=0
Apr 25 21:24:42 oberon kernel: block drbd2: asender terminated
Apr 25 21:24:42 oberon kernel: block drbd2: Terminating asender thread
(Host 1 is STONITHed at this point)

DRBD/Host 2
Apr 25 21:24:42 titania kernel: block drbd2: PingAck did not arrive in time.
Apr 25 21:24:42 titania kernel: block drbd2: peer( Primary -> Unknown ) 
conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 
-> 1 )
Apr 25 21:24:42 titania kernel: block drbd2: asender terminated
Apr 25 21:24:42 titania kernel: block drbd2: Terminating asender thread
Apr 25 21:24:42 titania kernel: block drbd2: Connection closed
Apr 25 21:24:42 titania kernel: block drbd2: conn( NetworkFailure -> 
Unconnected )

Host 2 continues, brings up the 2 VMs sucessfully, etc.

I assume the ping not arriving in time to host 2 causes the socket to 
shut down on host 1?

The ping time out is the default 5/10'th sec. Why is it timing out when 
this guest VM is rebooted?

The 2 host servers are have a dedicated Intel 10 Gigabit AT2 adaptor for 
DRBD.

I have a feeling this may have started after when the guest Windows VM 
had more memory assigned, from about 15Gb to 20Gb, and I wonder if 
Windows is writing some large memory dump when rebooting which pushes 
DRBD's replication too far?

Simply upping the ping timeout seems like the wrong solution, but is the 
only thing I can think of. Any suggestions welcome.

Cheers
Alastair Battrick