Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars Ellenberg wrote: > On Thu, Feb 14, 2008 at 05:43:15PM +0000, Massimo Mongardini wrote: > >> Hi, at my site we have an ha-nfs server with drbd+heartbeat that >> recently "failed to fail-over" to the secondary node. All the tests >> prior to production and after this error didn't have the same behaviour. >> What we guess had happened is that for some reason heartbeat >> detected and initiated the fail-over process before drbd could go on a >> WFconnection state. >> From what we've understood drbd should by default detect a failure >> within 6 to 16 seconds, but in our case it took around 30 seconds >> (16:43:11 -> 16:43:41 there is some seconds of delay considering remote >> syslogging). >> > > heartbeat deadtime (even warntime) should be larger than > any of drbd net timeout, ping-int, and probably also connect-int. > or, lookin in the other direction, > drbd timeouts need to be smaller. > > also, there is a "retry loop" in the resource.d/drbddisk script. > you may want to increase its max-try count. > > > finally, drbd 8 has the concept of a separate ping timeout, > so whenever you ask a secondary to become primary, > and it thinks the other node is still there, > it will immediately send a drbd-ping with a short timeout, > and if that is not answered in time, try to reconnect and proceed > with going primary. > for short failover times, drbd 8 is more suitable than drbd7. > > Thanks for this, for the moment we'll try and fix the drbddisk script and we'll look into migrate to drbd8. cheers Massimo -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080221/c1c9741c/attachment.htm>