Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Reviving an old thread ... On Mon, Sep 12, 2005 at 11:16:33AM -0500, Dave Dykstra wrote: > On Fri, Sep 09, 2005 at 12:06:31AM +0200, Lars Ellenberg wrote: > > / 2005-09-08 11:39:28 -0500 > > \ Dave Dykstra: > > > Alan Robertson wrote: > > > > Regarding DRBD not taking over when it hasn't declared the other node > > > > dead, I would think that a good solution might be to have DRBD wait up > > > > to "drbd deadtime" seconds before giving up. > > > > > > > > Since Heartbeat happily has no clue about DRBD (or its internal > > > > deadtime), it would seem to be best dealt with by DRBD. > > > > > > > > -- > > > > Alan Robertson <alanr at unix.sh> > > > > > > That sounds like a reasonable solution to me. > > > > > > In fact, I already see code to try a 'drbdadm primary' command 6 > > > times with a one second sleep between each try in the 'start' case of > > > /etc/heartbeat/resource.d/drbddisk, and a comment saying that it is "in > > > case heartbeat deadtime was smaller than drbd ping time". This is a > > > different situation than the comment, in that heartbeat forcibly knocked > > > down the remote side and immediately took over, so the timeout probably > > > starts ticking when the first 'drbdadm primary' command is executed. > > > I'm using the default drbd "timeout" time of six seconds, so presumably > > > on the 7th or 8th try it would work. This problem occurred during normal operation a couple months ago, and one month ago I was about to leave the job where I set this up and I didn't want to leave it in a state where this could happen to them so I experimented with increasing the number of times that loop tried to go primary. FYI I found that a count of 10 would usually work after forcibly killing heartbeat processes on the active side, but one time it didn't so I increased the count to 20 and it always worked after that. It would be better for it to use a formula based on the deadtime but I didn't try to do that. I did notice that the way the loop is written, trying first before sleeping one second, that the number of elapsed seconds is one less than the number of tries. > > > I think that doing multiple tries in the drbddisk command is a hack, > > > though, especially since it doesn't take into account any change in > > > the "timeout" parameter that there may be in drbd.conf. I think the > > > 'drbdsetup primary' command (possibly with a new option that drbddisk > > > invokes) should try to contact the remote side and wait until there is > > > either a positive response or a timeout before it exits with an error. > > > > what is there is a "hack". > > > > it is a misconfiguration, when heartbeat deadtime was > > smaller than drbd ping time, though. > > > > still it could be desirable to have an option like that outlined above, > > "drbdsetup /dev/drbd0 primary --I-think-peer-is-dead", and this option > > would typically be used by the heartbeat resource script/agent. > > I think rather it should be something like --I-think-peer-may-be-dead > because the heartbeat resource script would do the same thing no matter > how it is coming up. > > > this will probably be implemented in 0.8 ... I see that the latest 8.0 pre-release code in subversion is still using a loop count of 6 in the drbddisk script and is not using an option like one we discussed. If this is still quite low on the priority list, I suggest that the loop count maximum in drbddisk be increased for now because it's easy and it does work. Lars, what do you think? - Dave