[DRBD-user] Split Brain due to 'diskless' state with pacemaker/heartbeat

Fri Jun 1 18:22:10 CEST 2012

On Fri, Jun 01, 2012 at 02:20:31PM +0200, Florian Haas wrote:
> On 06/01/12 11:16, Philip Gaw wrote:
> > 1. Secondary goes into diskless state due to a broken array (as
> > expected) while primary is being written to
> > 2. Primary then dies (power failure)
> > 3. Secondary gets rebooted, or dies and comes back online etc.
> > 
> > The secondary will become primary - and as secondary was 'diskless'
> > before, contains out of date stale data.
> > 
> > When the 'real' primary comes back online we then have our split brain.
> > 
> > 
> > What I think needs to happen is a way to mark 'diskless' state as
> > outdated so that pacemaker will not attempt to bring this node into
> > primary.
> 
> That's a catch-22. The "outdated" state is stored locally in the DRBD
> metadata, which we don't have access to if the resource is Diskless.
> 
> > As this disk is diskless with internal metadata, this cannot be stored
> > in the drbd metadata.
> 
> Which presently rules out the Outdated state. So you figured that part
> out yourself.

There is one improvement we could make in DRBD:
call the fence-peer handler not only for connection loss,
but also for peer disk failure.

> > Alternitively, a constraint in pacemaker on diskless state until a
> > re-sync has been completed.
> 
> You could actually do that with using the crm-fence-peer.sh handler as
> your local-io-error handler, albeit with two drawbacks:
> 
> 1. The local-io-error has an exit code convention that is different from
> the fence-peer one (so you'd need to use a wrapper).

exit code of local-io-error handler is ignored

> 2. In order to actually mask the I/O error from your upper layers, you'd
> now have to call "drbdadm detach" from the local-io-error handler, and
> iirc calling drbdadm from a drbdadm handler is a bad idea.

local-io-error handler is called after the device was detached already.
it is just an additional action.

> > Any Suggestions?
> 
> Lars: would it make sense for a Secondary that detaches (either by user
> intervention or after an I/O error) to at least _try_ to outdate itself
> in the metadata?

I think it does.

There is a related scenario:
 Alice crashes.

 Bob was primary already, or it took over, does not matter.
 Bob continues to modify data.

 Bob down (clean or unclean, does not matter).

 Alice comes back.

 Now what?
   What should a single node (in a two node cluster) do after startup?
   It does not know if it has good or bad data.
   Even if bob had placed a constraint,
   in this scenario that constraint can not make it to alice.

   So there you have your policy decision.
   If you do not know for sure,
     Do you want to stay down just in case,
     risking downtime for no reason?

     Do you want to go online, despite your doubts,
     risking going online with stale data?

With multiple failures, you will always be able to construct
a scenario where you end up at the above policy decision.

In any case, if you configure fencing resource-and-stonith,
drbd comes up as "Consistent" only (not UpToDate),
so it needs to fence the peer, or promotion will fail.
If the peer is unreachable (exit code 5), and DRBD is only Consistent,
drbd counts that as fail, and will refuse promotion.

To make it "do the right thing" more often, you can add a third node
(real quorum, someone for the cib to sync your contraints to, ...),
use drbd fencing resource-and-stonith, and possibly do some
"local improvements" to the handler script, to exit 5 more often,
depending on your paranoia level.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed