[DRBD-user] Need help with automatic split-brain recovery

Mon Jan 26 09:47:44 CET 2009

On Fri, Jan 23, 2009 at 04:14:01PM +0100, Tobias Appel wrote:
> Hi everyone,
> 
> I'm running RHEL5 and installed the following drbd packages:
> 
> drbd.i386                                8.0.14-23.el5          
> drbd-kmdl-2.6.18-92.1.22.el5.i686        8.0.14-23.el5          
> drbd-kmdl-2.6.18-92.1.22.el5PAE.i686     8.0.14-23.el5          
> 
> I'm using heartbeat and configured DRBD as master/slave resource (but
> this is not really the important part). I hit the reset button on the
> primary node currently running DRBD, heartbeat did the failover
> correctly but when the node came back a split-brain was detected and
> when I do cat /proc/drbd now it says:
> 
>  cs:StandAlone st:Secondary/Unknown ds:UpToDate/DUnknown   r---
> 
> Weird thing is, I have configured automatic split-brain recovery in my
> drbd.conf and I thought this was ok, this is a part of my drbd.conf:
> 
> net {
> timeout 60;
> connect-int 10;
> ping-int 10;
> max-epoch-size 2048;
> max-buffers 2048;
> after-sb-0pri discard-younger-primary;
> after-sb-1pri consensus;
> after-sb-2pri disconnect;
> }
> 
> I also have a handler configured like this:
> 
> handlers {
> split-brain "/usr/lib/drbd/notify.sh tappel at eso.org";
> }
> 
> I downloaded the notify.sh manually and put it there since it did not
> came with my rpm (from the dag repository for rhel5).
> I received two emails, but somehow it seems DRBD did not see this as a
> split-brain because the messages says the following:
> 
> DRBD on nagios2 was configured to launch a notification handler
> for resource r0,
> but no specific notification event was set.
> This is most likely due to DRBD misconfiguration.
> Please check your configuration file (usually /etc/drbd.conf).
> 
> Now what did I do wrong in my configuration file?

that notify script expects to be symlinked with different base names,
and deduces the "specific action" from its base name.
call the symlink "notify-split-brain.sh", and reference that from your
drbd.conf.

but the important point here is that,
if the only thing you did was "hit the reset button on the Primary",
it should only be a normal failover, reboot of the "failed" (reset) box,
rejoin, resync, done.

the reset box should not be promoted to Primary before
reconnecting and resynchronising with the other node.

first you should find out why you run into a split brain.
 the kernel logs (and maybe heartbeat logs) should tell you.

then you should fix that.

and only when you fixed your setup to work properly for these common
failover scenarios, you should worry about automatic split brain
recovery.

automatic split brain recovery is not a solution.
but a crude work around for an insufficient setup.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed