[DRBD-user] Split Brain due to 'diskless' state with pacemaker/heartbeat

Florian Haas florian at hastexo.com
Fri Jun 1 14:20:31 CEST 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On 06/01/12 11:16, Philip Gaw wrote:
> 1. Secondary goes into diskless state due to a broken array (as
> expected) while primary is being written to
> 2. Primary then dies (power failure)
> 3. Secondary gets rebooted, or dies and comes back online etc.
> 
> The secondary will become primary - and as secondary was 'diskless'
> before, contains out of date stale data.
> 
> When the 'real' primary comes back online we then have our split brain.
> 
> 
> What I think needs to happen is a way to mark 'diskless' state as
> outdated so that pacemaker will not attempt to bring this node into
> primary.

That's a catch-22. The "outdated" state is stored locally in the DRBD
metadata, which we don't have access to if the resource is Diskless.

> As this disk is diskless with internal metadata, this cannot be stored
> in the drbd metadata.

Which presently rules out the Outdated state. So you figured that part
out yourself.

> Alternitively, a constraint in pacemaker on diskless state until a
> re-sync has been completed.

You could actually do that with using the crm-fence-peer.sh handler as
your local-io-error handler, albeit with two drawbacks:

1. The local-io-error has an exit code convention that is different from
the fence-peer one (so you'd need to use a wrapper).

2. In order to actually mask the I/O error from your upper layers, you'd
now have to call "drbdadm detach" from the local-io-error handler, and
iirc calling drbdadm from a drbdadm handler is a bad idea.

> Any Suggestions?

Lars: would it make sense for a Secondary that detaches (either by user
intervention or after an I/O error) to at least _try_ to outdate itself
in the metadata? Granted, if there is an actual I/O problem that also
affects the metadata area this would fail, and if you've got an I/O
tarpit it might actually exacerbate the problem, but at least DRBD could
try. Or does it do that already?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now



More information about the drbd-user mailing list