[DRBD-user] Recovery if active heartbeat dies before drbd

Thu Jun 9 03:59:26 CEST 2005

On Tue, Jun 07, 2005 at 11:53:34PM +0200, Lars Ellenberg wrote:
> / 2005-06-07 15:26:33 -0500
> \ Dave Dykstra:
> > On Thu, May 19, 2005 at 09:14:17AM -0500, Dave Dykstra wrote:
> > > In order to test heartbeat's stonith, I have been doing kill -9 on the
> > > heartbeat processes on the active server.  What happens then is after
> > > heartbeat's timeout period, the standby server uses stonith to pull
> > > the power on the active server and immediately tries to bring up its
> > > drbd as primary.  That fails, I presume because its drbd still thinks
> > > the other side is primary.  I don't think heartbeat passes on any drbd
> > > error messages to /var/log/messages, so it is just a guess, but failover
> > > works if I just pull the power plug on the whole active server or kill
> > > heartbeat & drbd proceses at the same time, so that must be the problem.
> > 
> > I've seen no response to this question.  Lars, what do you think?
> 
> first,
> from DRBD ChangeLog
> 
>      ...
> 
>     0.7.6 (api:77/proto:74)
>     -----
>      ...
>      * Improvements to the drbddisk script, to do the right thing
>        in case Heartbeat is configured with a small timeout than DRBD.
>      ...

I am running 7.10.

> and second: heartbeat 1.2.3 still has a bug (which is already fixed in the cvs
> branch STABLE_1_2) that ignores failure of resource scripts, i.e.
> if one resource fails to start, it still continues and (tries to) start
> resources later in the list that (may) depend on the earlier, failed
> resource.

I am actually running a STABLE_1_2 CVS version of heartbeat and have
noticed the behavior of backing off on a failed resources script when an
/etc/init.d script returned an error code when trying to start a
service that was already started.   Both nodes ended up being shut down,
which was not an improvement !  In the case I'm talking about, the
remote end has just been stonith'ed so backing off on a failed
resource is not going to help.

> > > Wouldn't it make sense for drbd, when told to become primary, to do a
> > > quick check, maybe one or two queries with one-second timeouts, to see
> > > if its peer is still alive and if not then go ahead and become primary?
> 
> I don't think so.
> but you are free to configure smaller timeouts for drbd.
> just be carefull that you don't lose connection because of too short
> timeout periods.

No, I think you missed the point.  Remember that only heartbeat died
on the active server.  The active server's drbd keeps running in the
case I'm talking about until the power is pulled and then heartbeat
immediately moves on to try to bring up the backup server.  I can't make
the drbd timeout short enough.   The only other alternative is to make
heartbeat wait after a stonith, and I'm not quite sure when & how long
it should wait.

- Dave