<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Lars Ellenberg wrote:
<blockquote cite="mid:20080221093517.GA5685@barkeeper1.linbit"
type="cite">
<pre wrap="">On Thu, Feb 14, 2008 at 05:43:15PM +0000, Massimo Mongardini wrote:
</pre>
<blockquote type="cite">
<pre wrap=""> Hi, at my site we have an ha-nfs server with drbd+heartbeat that
recently "failed to fail-over" to the secondary node. All the tests
prior to production and after this error didn't have the same behaviour.
What we guess had happened is that for some reason heartbeat
detected and initiated the fail-over process before drbd could go on a
WFconnection state.
From what we've understood drbd should by default detect a failure
within 6 to 16 seconds, but in our case it took around 30 seconds
(16:43:11 -> 16:43:41 there is some seconds of delay considering remote
syslogging).
</pre>
</blockquote>
<pre wrap=""><!---->
heartbeat deadtime (even warntime) should be larger than
any of drbd net timeout, ping-int, and probably also connect-int.
or, lookin in the other direction,
drbd timeouts need to be smaller.
also, there is a "retry loop" in the resource.d/drbddisk script.
you may want to increase its max-try count.
finally, drbd 8 has the concept of a separate ping timeout,
so whenever you ask a secondary to become primary,
and it thinks the other node is still there,
it will immediately send a drbd-ping with a short timeout,
and if that is not answered in time, try to reconnect and proceed
with going primary.
for short failover times, drbd 8 is more suitable than drbd7.
</pre>
</blockquote>
Thanks for this, for the moment we'll try and fix the drbddisk script
and we'll look into migrate to drbd8.<br>
cheers<br>
Massimo<br>
</body>
</html>