[DRBD-user] To stonith or not to stonith?

Tue Sep 6 23:54:16 CEST 2005

On Tue, Aug 30, 2005 at 11:24:29AM +0200, Lars Ellenberg wrote:
> / 2005-08-29 13:39:20 -0500
> \ Dave Dykstra:
> > (someone asked about why to use stonith if DRBD prevents corruption)
> > Drbd will prevent data corruption on its own, but stonith with drbd can
> > give you increased uptime because there are cases when a standby drbd or
> > heartbeat will refuse to take over until the formerly active one has been
> > proven to be shut down.
> 
> which are: ... ?
> 
> 
> 
> btw:
> we at LINBIT make sure that heartbeat has as many communication
> channels as possible, but try to avoid stonith in most deployments:
> we had cases where heartbeat would reboot one node, and might have
> stonithed the other at the same event -- not exactly heartbeats fault,
> more "misbehaving resource agents", but still very annoying.
> 
> we feel better if we automatise as less as possible,
> though obviously as much as necessary or convenient.
> 
> as far as I can see, stonith with drbd does not really buy you anything.

You know better than I do, Lars, about the states that DRBD can get into,
but I know that heartbeat tries very hard to avoid split brain and doesn't
distinguish between whether it's using DRBD or not.   I initially tried
to get by without stonith but eventually came to the conclusion that I
needed it because failovers sometimes didn't happen properly.   Come to
think of it, it may be because if heartbeat dies on the active side but
DRBD doesn't, the takeover by heartbeat fails and I had assumed that a
stonith would clean that up.  As it turns out, DRBD still won't take over
immediately after a stonith, not until it times out, and that continues
to be a thorny issue that I've raised on both mailing lists and do not
have yet have an answer for.

Alan, can you comment?

- Dave