Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Feb 14, 2008 at 05:43:15PM +0000, Massimo Mongardini wrote: > Hi, at my site we have an ha-nfs server with drbd+heartbeat that > recently "failed to fail-over" to the secondary node. All the tests > prior to production and after this error didn't have the same behaviour. > What we guess had happened is that for some reason heartbeat > detected and initiated the fail-over process before drbd could go on a > WFconnection state. > From what we've understood drbd should by default detect a failure > within 6 to 16 seconds, but in our case it took around 30 seconds > (16:43:11 -> 16:43:41 there is some seconds of delay considering remote > syslogging). heartbeat deadtime (even warntime) should be larger than any of drbd net timeout, ping-int, and probably also connect-int. or, lookin in the other direction, drbd timeouts need to be smaller. also, there is a "retry loop" in the resource.d/drbddisk script. you may want to increase its max-try count. finally, drbd 8 has the concept of a separate ping timeout, so whenever you ask a secondary to become primary, and it thinks the other node is still there, it will immediately send a drbd-ping with a short timeout, and if that is not answered in time, try to reconnect and proceed with going primary. for short failover times, drbd 8 is more suitable than drbd7. -- : Lars Ellenberg http://www.linbit.com : : DRBD/HA support and consulting sales at linbit.com : : LINBIT Information Technologies GmbH Tel +43-1-8178292-0 : : Vivenotgasse 48, A-1120 Vienna/Europe Fax +43-1-8178292-82 : __ please use the "List-Reply" function of your email client.