[DRBD-user] To stonith or not to stonith?

Wed Sep 7 00:47:53 CEST 2005

Dave Dykstra wrote:
> On Tue, Aug 30, 2005 at 11:24:29AM +0200, Lars Ellenberg wrote:
>>/ 2005-08-29 13:39:20 -0500
>>\ Dave Dykstra:
>>>(someone asked about why to use stonith if DRBD prevents corruption)
>>>Drbd will prevent data corruption on its own, but stonith with drbd can
>>>give you increased uptime because there are cases when a standby drbd or
>>>heartbeat will refuse to take over until the formerly active one has been
>>>proven to be shut down.
>>which are: ... ?
>>
>>
>>
>>btw:
>>we at LINBIT make sure that heartbeat has as many communication
>>channels as possible, but try to avoid stonith in most deployments:
>>we had cases where heartbeat would reboot one node, and might have
>>stonithed the other at the same event -- not exactly heartbeats fault,
>>more "misbehaving resource agents", but still very annoying.
>>
>>we feel better if we automatise as less as possible,
>>though obviously as much as necessary or convenient.
>>
>>as far as I can see, stonith with drbd does not really buy you anything.
> 
> You know better than I do, Lars, about the states that DRBD can get into,
> but I know that heartbeat tries very hard to avoid split brain and doesn't
> distinguish between whether it's using DRBD or not.   I initially tried
> to get by without stonith but eventually came to the conclusion that I
> needed it because failovers sometimes didn't happen properly.   Come to
> think of it, it may be because if heartbeat dies on the active side but
> DRBD doesn't, the takeover by heartbeat fails and I had assumed that a
> stonith would clean that up.  As it turns out, DRBD still won't take over
> immediately after a stonith, not until it times out, and that continues
> to be a thorny issue that I've raised on both mailing lists and do not
> have yet have an answer for.

STONITH can keep both sides from becoming master.
This requires human intervention to recover from.

It's not a happy circumstance.  And, STONITH avoids it.

BUT, it's not as serious as if it were true shared storage - in which 
case all the online data is destroyed - an even less happy circumstance.

I don't see a huge problem caused by stonithing a node which has killed 
itself.  But, maybe I missed something.

Regarding DRBD not taking over when it hasn't declared the other node 
dead, I would think that a good solution might be to have DRBD wait up 
to "drbd deadtime" seconds before giving up.

Since Heartbeat happily has no clue about DRBD (or its internal 
deadtime), it would seem to be best dealt with by DRBD.

-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce