[DRBD-user] Re: Suggestion to prevent split brain situation

Lars Ellenberg lars.ellenberg at linbit.com
Sun Dec 2 13:56:26 CET 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


you have to subscribe to be allowed to post here.

On Fri, Nov 30, 2007 at 06:53:32PM +0100, Bas van Schaik wrote:
> Hi all,
> 
> DRBD is providing redundant storage to my Xen-domains for a few months
> now, it works like a charm! However, I can imagine a situation in which
> a split brain occurs, which is obviously not desirable. First some
> "definitions" concerning my configuration:
>  - I'm using DRBD 0.7, but I think this scenario is applicable to DRBD
> 0.8 too
>  - XenServerA runs x domains and is passive for y domains from XenServerB
>  - XenServerB runs y other domains and is passive for the x domains of
> XenServerA
>  - Both XenServers have some domains configured for autostart at boot
> time using /etc/xen/auto/*
> 
> Now:
>  1) Someone migrates XenDomain1 from XenServerA to XenServerB and
> forgets to update /etc/xen/auto/* on both machines.
>  2) Power failure in datacenter, switch in network between XenServerA
> and XenServerB dies as a result of the power failure.
>  3) The UPS of the XenServers drops out after 15 minutes of power
> failure, both XenServers have time to issue a clean shutdown.
>  4) Power is restored, XenServers boot.
>  4) XenServerA tries to become primary for DRBD resource of XenDomain1
> and succeeds: XenServerB cannot be reached, state of the DRBD resource
> becomes Primary/Unknown. However, the storage of XenServerA is out of
> date compared to the storage in XenServerB, but DRBD doesn't know!
>  5) Switch is replaced, XenServers connect and try to resync: split
> brain is discovered!

4)
	configure multiple heartbeat channels.
	if you do not have _independent_ heartbeat communication
	channels, you will experience split brain.

	configure resource level fencing in drbd.
	so when the drbd replication link is down,
	if will first need to outdate its peer before becoming primary.

	if your really want to, configure stonith.
	heartbeat would not activate a resource unless it confirmed a
	successful stonith of a suposedly dead node.
	you can also have a "meatware stonith" plugin, which just sits
	there for the admin to confirm that the other node is in fact
	dead, until it continues.  you increase failover time
	considerably here, though.

> Although this is quite a "worst case" scenario,

right.
single heartbeat link.
there is no worse a case.

> one can imagine other
> comparable scenarios in which the split brain can easily occur. To
> prevent this from happening, I suggest that DRBD refuses to become
> primary of a device of which the other side has state "Unknown", unless
> the user passes the "--do-what-I-say" switch.

no.
configure resource level fencing in drbd.  use multiple heartbeat
channels, configure dopd, use the drbd-peer-outdater.
you should use recent drbd and heartbeat >= 2.1.2 for this.

> An additional feature would be to "remember" the last state of a DRBD
> device: if the last known state was something like
> "Secondary/Unknown", DRBD should really refuse to become primary of
> that device.

no. definetely not.
would prevent reboot of a degraded cluster.
again, configure resource level fencing.

drbd init script by default after configuration
waits for connection.
it uses diferent timeouts depending on
wether it was connected before or not.
the default is to wait forever if it was connected before,
and to wait two minutes in case it was not able to see its peer before.

> This would at least prevent my Xen servers (using block-drbd script)
> from automatically starting domains and creating a split brain. Of
> course the migrated XenDomain1 will not be instantiated at boot time by
> either one of the Xen servers, but that's not as bad as creating a split
> brain!
> 
> Please let me know what you (especially the developers :P) think of my
> proposal.

think it through again.
and use what is available.

the main point being:
 use
 multiple
 independent
 communication channels.

-- 
: commercial DRBD/HA support and consulting: sales at linbit.com :
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list