[DRBD-user] drbd heartbeat "failed to fail-over"

Massimo Mongardini massimo.mongardini at gmail.com
Thu Feb 21 17:42:30 CET 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

Lars Ellenberg wrote:
> On Thu, Feb 14, 2008 at 05:43:15PM +0000, Massimo Mongardini wrote:
>>    Hi, at my site we have an ha-nfs server with drbd+heartbeat that 
>> recently "failed to fail-over" to the secondary node. All the tests 
>> prior to production and after this error didn't have the same behaviour.
>>    What we guess had happened is that for some reason heartbeat 
>> detected and initiated the fail-over process before drbd could go on a 
>> WFconnection state.
>>    From what we've understood drbd should by default detect a failure 
>> within 6 to 16 seconds, but in our case it took around 30 seconds 
>> (16:43:11 -> 16:43:41 there is some seconds of delay considering remote 
>> syslogging).
> heartbeat deadtime (even warntime) should be larger than
> any of drbd net timeout, ping-int, and probably also connect-int.
> or, lookin in the other direction,
> drbd timeouts need to be smaller.
> also, there is a "retry loop" in the resource.d/drbddisk script.
> you may want to increase its max-try count.
> finally, drbd 8 has the concept of a separate ping timeout,
> so whenever you ask a secondary to become primary,
> and it thinks the other node is still there,
> it will immediately send a drbd-ping with a short timeout,
> and if that is not answered in time, try to reconnect and proceed
> with going primary.
> for short failover times, drbd 8 is more suitable than drbd7.
Thanks for this, for the moment we'll try and fix the drbddisk script 
and we'll look into migrate to drbd8.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080221/c1c9741c/attachment.htm>

More information about the drbd-user mailing list