[DRBD-user] DRBD 8.0, how to manage a split-brain on Master Master

Tue Feb 13 11:30:04 CET 2007

> What is the best way to manage a split-brain on a Master<>Master setup on
> DRBD 8.0 ?
Personally, I'd say "Manually" or "With extreme prejudice".  Anything  
else is likely to cuase difficulty somewhere - and that's a generic  
thing about split brain, not drbd specific.

> DRBD will see by itself see if a node is up for what I unserstood,
DRBD will see if the node is up, and available via a network  
connection.  There's no magic "It's up, even though I can't talk to it  
over the defined network" option.  Support for a secondary network  
link would be nice, but it's probably not worth the extra effort.

> will it be automaticly sync it to 1-1 for both ways ?
If they are both still in "primary" mode, then this isn't going to  
work.  There's no way to do a 2 way resync without understanding of  
the higher level data - and even then you're likely to get conflicts.

Take the simple case of a single EXT3 partition that got mounted on  
both as a result of a split brain - you could use something like rsync  
to resync the filesystems, but that wouldn't necessarily be the right  
thing to do anyway:
1) Edit file A on node 1 (A->A1), edit file B on node 2 (B->B2)
2) Now wait a while, and edit B on node1 (B->B1), and A on node 2 (A-A2).
3) remerge, and rsync.
4) You are left with A2 & B1, which means that you've not got the  
correct data from either mirror, and probably nothing that makes any  
sense (think about them as config files, you've got half the config  
from each node).

> Some People claim that when node-2 came back online it needs to resync all
> the data from node-01. Is this true, or is it smart enought to only sync
> the new files ?
Files are at a different level to the drbd device.  Without teaching  
the sync utility about all the available file systems (ext2, ext3,  
reisferfs, jfs, xfs, gfs, ocfs, the list goes on) this couldn't  
happen.  And even then (see example above) is probably not what you  
want.
Y ou could possibly get away with re-syncing any blocks that have  
changed on either node to the secondary you pick (e.g. add together  
the change list for both nodes, and push that) but I'm not sure that  
I'd want to do it that way, even if in theory it would work... I'm too  
scared that my data would be corrupted (although if that is what this  
does, I'm happy to trust these guys - it's my code I don't trust).

> I think DRBD 8.0 has almost everything in this case you need, the only
> think is a split-brain that you have to manange well.
With split brain and drbd, one of the two nodes is about to be told  
that it's wrong.  That it's data is wrong, and that all that it thinks  
it knows about the drbd device is wrong, and this tends to get ugly.

The only safe way (without lots of hooks into lots of applications) to  
do this is to kill anything that's talking to the device, refresh the  
device from the copy you want to use (or just make it available during  
the resync), and then let things access it again.  You can't pause and  
re-allow, since that way the app could easily have cached data - it  
has to be a kill.  Personally my belief is that the best option is to  
reboot the secondary node - that way you guarantee that everything is  
reset to a known good state - but I certainly accept that this is a  
little heavy for some uses.  I think it's configurable within the drbd  
config file - mine is set to disconnect instead, and wait for me to  
deal with it - as I said, I think manual intervention is the way to go  
at this point.

To put it another way, the only way to really deal with split brain is  
to not let it develop in the first place - and this is something  
that's been causing grief for clusters for a long time.  In a 2 node  
environment you basically have three options (that I can think of) for  
how to avoid it:
1) Have a 3rd "thing" that you just use as a votekeeper.
    Advantage:    This means that you've got 3 votes available, and therefore
                  you can never have a 50/50 split.
    Disadvantage: You have extra complexity, and dependance on a 3rd device
2) Weight one of the nodes as "more important".
    Advantage:    Very simple to do, very easy to configure, "just works".
    Disadvantage: The other node cannot operate if the more important one is
                  not available, without manual intervention
3) STONITH (Shoot The Other Node In The Head)
    Advantage:    It means that one node will be down if you ever end up in a
                  split brain situation.
    Disadvantage: It kills one of the machines (fsck, etc) - and normally needs
                  human intervention to bring it back.  You can also,  
if you are
                  unlucky, end up with both nodes dead (Have had this  
happen with
                  Sun Cluster 2.1).  Which is great for data  
consistency, but is
                  a bit silly.

With any even number of "votes", you've got the possibility of a 50/50  
split.  With 4 or more you've got other options that would work  
reasonably well - two nodes is often treated as a "special case".

Graham