[DRBD-user] Default Split Brain Behaviour

Fri Jan 28 01:01:51 CET 2011

Thanks for the reply Lars,

> > Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905963] block drbd9:
> > drbd_sync_handshake:
> > Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905967] block drbd9: self
> > 49615ABF1622FC55:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807
> > bits:143432 flags:0
> > Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905971] block drbd9: peer
> > 6116B0558277E470:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807
> > bits:336381 flags:0
> 
> There. Both nodes have changes the other node did not see (yet).
> That's where DRBD can detect that there previously has been data
> divergence, usually caused by cluster split brain.

I'm struggling to see how the secondary node could have write changes as it has never been primary.
The resource had originally been in sync, then was manually switched to a detached state for roughly 8 days prior to the data rollback.
The primary node as mentioned was a KVM instance, this instance does not exist (never has) on the secondary node.

> So... My guess is, that you still have two versions of your data.
> 
> From this log, there was no sync, because DRBD default behavior in
> that
> case it to disconnect. Therefore no rollback, and no data loss.
> But you certainly have diverging data sets, and my guess is they keep
> diverging still.

That's what I'd be happy for it to do, but the complete rollback of 8 days of work on a web site is pretty obvious and contrasts.

> You have to figure out when they started to diverge, and why.
> And you have to sort it out, decide which to keep,
> and tell DRBD (see the User's Guide for details on this).
I'd kept the two separate and taken the KVM instance offline in the vain hope that I may have been able to rollback the rollback.
I made dd images of each nodes LVM associated with the resource just in case, but have now accepted my losses so to speak and begun the reconstruction.

I've been using DRBD since 2005, and although clearly having much to learn, I'd like to think I have a reasonable handle on the fundamentals.
What I've experienced with the data roll back is both unexpected and unintended.
I'm still unclear as to how this node came to discard 8 days worth of data, but am very keen to do so.
If you good people are prepared to guide me further, I'm prepared to do what is necessary at my end to try determine the cause of this.

Cheers,

Lew