[DRBD-user] Default Split Brain Behaviour

Mon Jan 31 09:11:04 CET 2011

On 01/29/2011 02:43 AM, Lewis Shobbrook wrote:
> That's correct no sync has taken place & it is still un-synced.
> 
> ...
> 
> The resource nodes are still disconnected and no override has been used to force the situation.
> The only commands issued have been drbdadm connect all, drbdadm connect x2, drbdadm primary x2 (on the only node that has ever been primary) and drbd attach.
> I'm the only one with access to these machines, I can assure you sync has not been forced at any time.
> 
> The only log record against this resource in all  archived messages prior to the system restart is..
> Jan 11 11:30:31 emlsurit-v4 kernel: [7745016.672246] block drbd9: disk( UpToDate -> Diskless )
> I expect this is the point at which the drbdadm detach was issued, while the node was primary and active. 

Holy shit. Now this is a useful piece of information.

You made your Primary diskless 12 days before you aleged DRBD problem.
This of course leads to all writes being done on the Secondary (you're
in a degraded state).

This is all fine except after reboot, you made your main node (the one
that's been diskless for a couple of days) Primary again before the
handshake took place. Hence split-brain.

> From the command history I can't determine which node the detach was issued from.

The one that went diskless. Detach is a local operation and doesn't
affect the peer.

> Does it matter which node a drbdadm detach is issued from?

Yes, it's essential.

> On node A it details system start Jan 23 15:07:16,  
> The resource was later set primary before network connection between the nodes...
> Jan 23 15:53:01 emlsurit-v4 kernel: [ 2756.121108] block drbd9: role( Secondary -> Primary ) 
> Jan 23 15:53:01 emlsurit-v4 kernel: [ 2756.122546] block drbd9: Creating new current UUID
> A minute later we can see the KVM instance start up and libvirt access the resource...

As stated above, going primary whilst unconnected is potentially harmful.

> Perhaps what has been confusing the matter is my initial post associating split-brain with the data loss.
> The node was primary and active prior to any split brain, and it seems to me that the roll back/loss of data had occurred prior to split-brain. 
> The only conceivable possibility to me, is still that NodeA has rolled back or discarded changes in it's activity log following the restart.

It has not. During split-brain, no data is synced until you allow for it.

> As far as I can determine this occurred prior to the split-brain, while the resource nodes where still disconnected (prior to restoration of network connectivity).

Outright impossible.

> Just to be thorough, I'll export the KVM instance XML and start it up to investigate the other node, but do not expect to find the data that's missing there.

You should. In any case, I hope you haven't made the guest on the main
node operative in the meantime. Because you will really want to declare
that node the split-brain victim.

HTH,
Felix