[DRBD-user] drbd heartbeat "failed to fail-over"

Thu Feb 21 17:42:30 CET 2008

Lars Ellenberg wrote:
> On Thu, Feb 14, 2008 at 05:43:15PM +0000, Massimo Mongardini wrote:
>   
>>    Hi, at my site we have an ha-nfs server with drbd+heartbeat that 
>> recently "failed to fail-over" to the secondary node. All the tests 
>> prior to production and after this error didn't have the same behaviour.
>>    What we guess had happened is that for some reason heartbeat 
>> detected and initiated the fail-over process before drbd could go on a 
>> WFconnection state.
>>    From what we've understood drbd should by default detect a failure 
>> within 6 to 16 seconds, but in our case it took around 30 seconds 
>> (16:43:11 -> 16:43:41 there is some seconds of delay considering remote 
>> syslogging).
>>     
>
> heartbeat deadtime (even warntime) should be larger than
> any of drbd net timeout, ping-int, and probably also connect-int.
> or, lookin in the other direction,
> drbd timeouts need to be smaller.
>
> also, there is a "retry loop" in the resource.d/drbddisk script.
> you may want to increase its max-try count.
>
>
> finally, drbd 8 has the concept of a separate ping timeout,
> so whenever you ask a secondary to become primary,
> and it thinks the other node is still there,
> it will immediately send a drbd-ping with a short timeout,
> and if that is not answered in time, try to reconnect and proceed
> with going primary.
> for short failover times, drbd 8 is more suitable than drbd7.
>
>   
Thanks for this, for the moment we'll try and fix the drbddisk script 
and we'll look into migrate to drbd8.
cheers
Massimo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080221/c1c9741c/attachment.htm>