[DRBD-user] Need help with automatic split-brain recovery

Tue Jan 27 12:22:07 CET 2009

On Mon, Jan 26, 2009 at 01:11:39PM +0100, Tobias Appel wrote:
> On Mon, 2009-01-26 at 09:47 +0100, Lars Ellenberg wrote:
> 
> Hi Lars,
> 
> thanks for your help. 

welcome.
please keep list mails to the list,
I very nearly missed this one...

> > but the important point here is that,
> > if the only thing you did was "hit the reset button on the Primary",
> > it should only be a normal failover, reboot of the "failed" (reset) box,
> > rejoin, resync, done.
> 
> you were right - it seemed there was an error in my heartbeat
> configuration, I did the exact same thing today and it worked perfectly.
> 
> I tried another thing with heartbeat though. I pulled the crossover
> cable connecting the 2 nodes. So heartbeat thought that each other node
> was dead and promoted DRBD to master on both nodes. I'm not sure if this
> is how it is supposed to be, but if I missed some part of the heartbeat
> configuration please let me know then I can ask at the heartbeat mailing
> list.
> Anyway, after I did plug in the cable again, a split-brain was detected.

well, if you cut all cluster communications,
yes, that is then split brain.
and it is only detectable once communication is reestablished.

> heartbeat turned off all the resources on one node and had only one node
> running, however the DRBD partition was each in standalone mode.

the data sets diverged.
and during DRBD handshake, there probably still were two primaries,
so no auto-recovery strategy could be used. they disconnected.

> I had
> to type in on both nodes: 'drbdadm connect r0' and then it was connected
> again. Did this test like it was supposed to be?

the outcome is as I would expect, yes.

to avoid the data divergence in the first place,
you may want to look into adding a third node, not to take any
resources, but to act as a tie breaker for quorum.

note that the heartbeat "quorumd" server is similar in concept,
but was said to not work properly on the heartbeat development lists,
so don't waste your time on that.

also look into the stonith "meatware" module,
which basically asks for operator confirmation
before taking over services after a node became unreachable
and has been declared dead, maybe you can use that to your advantage.

> I even got the email
> now and it did work after the split-brain just with a small manual
> intervention (which is fine). 
> Or did I miss something with the automatic split-brain recovery?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.