[DRBD-user] Avoid split brain in a dual primary configuration with intelligent switches

Tue Mar 17 13:35:38 CET 2009

I am replying to myself. Re-reading drbd documentation I finally found the
write quorum explanation.
And I discovered that why, using dual primary, I always get a split brain
after a disconnection.

Now I do not understand two things:

- why single primary mode (master slave) does not need a write quorum;
- how dual primary works. I think about this (mode C):

Good communication:
server A receive an order to write a disk block;
server A writes it on disk;
A send it to B;
B writes and ack;
A receive ack and tell upper layer that write is good.

Bad communication:
server A receive an order to write a disk block;
server A writes it on disk;
A send it to B;
COMMUNICATION FAILURE
B does not receive anything nor it can reply;
A timeouts and sends upper layer a write error.

I suppose now that A and B are blocked (they cannot complete writes) and I
(or the cluster manager) can decide to shutdown one server.

My question is: I still does not understand what drbd really do in this
situation, is like above or is different?
The other question is why single primary mode (master slave) does not need a
write quorum?

Thanks again for  help!

2009/2/10 Mario Giammarco <mgiammarco at gmail.com>

> Hello,
> I am trying to build an iscsi san using drbd in a dual primary
> configuration.
>
>  I have read drbd documentation and I have not fully understood how it
> handles the split brain.
>
> My hardware is setup as this: two identical server with raid6. Each one has
> 4 ethernet cards, configured as two trunks.
> Each trunk is connected to an hardware switch. The two switches are
> "intelligent", so they have an ip each.
>
> My idea (correct me if I am wrong) is this: when one primary finds that it
> cannot talk to other primary it tries to ping switches.
>
> If it cannot ping switches it means that all its ethernet cards or all
> switches are broken so it shutdown itself.
> If it can ping switches it means that other primary is broken so it tries
> to stonith it.
>
> After reconnection it is clear that the primary that cannot ping the
> switches must resync with other one.
>
> Can you say me if I can personalize the behaviour of drbd to follow these
> rules? (dopd? peer-outdater?)
> Can you say me if my rules are enough?
> Can you say me if drbd already implements a better strategy and so my rules
> are stupid?
>
> Thank you in advance for any reply!
>
> Mario
>
>
>