[DRBD-user] drbd heartbeat "failed to fail-over"

Thu Feb 21 10:35:17 CET 2008

On Thu, Feb 14, 2008 at 05:43:15PM +0000, Massimo Mongardini wrote:
>    Hi, at my site we have an ha-nfs server with drbd+heartbeat that 
> recently "failed to fail-over" to the secondary node. All the tests 
> prior to production and after this error didn't have the same behaviour.
>    What we guess had happened is that for some reason heartbeat 
> detected and initiated the fail-over process before drbd could go on a 
> WFconnection state.
>    From what we've understood drbd should by default detect a failure 
> within 6 to 16 seconds, but in our case it took around 30 seconds 
> (16:43:11 -> 16:43:41 there is some seconds of delay considering remote 
> syslogging).

heartbeat deadtime (even warntime) should be larger than
any of drbd net timeout, ping-int, and probably also connect-int.
or, lookin in the other direction,
drbd timeouts need to be smaller.

also, there is a "retry loop" in the resource.d/drbddisk script.
you may want to increase its max-try count.

finally, drbd 8 has the concept of a separate ping timeout,
so whenever you ask a secondary to become primary,
and it thinks the other node is still there,
it will immediately send a drbd-ping with a short timeout,
and if that is not answered in time, try to reconnect and proceed
with going primary.
for short failover times, drbd 8 is more suitable than drbd7.

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please use the "List-Reply" function of your email client.