[DRBD-user] Default Split Brain Behaviour

Lars Ellenberg lars.ellenberg at linbit.com
Fri Jan 28 11:29:03 CET 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

On Fri, Jan 28, 2011 at 11:01:51AM +1100, Lewis Shobbrook wrote:
> Thanks for the reply Lars,
> > > Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905963] block drbd9:
> > > drbd_sync_handshake:
> > > Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905967] block drbd9: self
> > > 49615ABF1622FC55:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807
> > > bits:143432 flags:0
> > > Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905971] block drbd9: peer
> > > 6116B0558277E470:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807
> > > bits:336381 flags:0
> > 
> > There. Both nodes have changes the other node did not see (yet).
> > That's where DRBD can detect that there previously has been data
> > divergence, usually caused by cluster split brain.
> I'm struggling to see how the secondary node could have write changes as it has never been primary.

Well, I cannot tell you that.

> The resource had originally been in sync, then was manually switched to a detached state for roughly 8 days prior to the data rollback.
> The primary node as mentioned was a KVM instance, this instance does not exist (never has) on the secondary node.
> > So... My guess is, that you still have two versions of your data.
> > 
> > From this log, there was no sync, because DRBD default behavior in
> > that
> > case it to disconnect. Therefore no rollback, and no data loss.
> > But you certainly have diverging data sets, and my guess is they keep
> > diverging still.
> That's what I'd be happy for it to do, but the complete rollback of 8 days of work on a web site is pretty obvious and contrasts.

Maybe the logs you posted do not match the incident described.

Or you attached to stale data, thinking a rollback had taken place,
but actually it is just stale data and the more recent data is still
on the other node.

But the logs you posted do not show any sync taking place, even cleary
show that DRBD refuses to do a sync because it detected data divergence.
There cannot have been a rollback, because there has been no sync,
again according to the logs you posted.

> > You have to figure out when they started to diverge, and why.
> > And you have to sort it out, decide which to keep,
> > and tell DRBD (see the User's Guide for details on this).
> I'd kept the two separate and taken the KVM instance offline in the vain hope that I may have been able to rollback the rollback.
> I made dd images of each nodes LVM associated with the resource just in case, but have now accepted my losses so to speak and begun the reconstruction.
> I've been using DRBD since 2005, and although clearly having much to learn, I'd like to think I have a reasonable handle on the fundamentals.
> What I've experienced with the data roll back is both unexpected and unintended.
> I'm still unclear as to how this node came to discard 8 days worth of data, but am very keen to do so.
> If you good people are prepared to guide me further, I'm prepared to do what is necessary at my end to try determine the cause of this.

Go back to your logs, and find the logs that match the incident

What is the status of that pair of DRBD now?
Is it actually "cs:Connected, UpToDate/UpToDate" ?

Find out when it became so, and how.  Because, again, the logs you
showed previously, state, that DRBD refused to connect.
If it finnaly synced up and connected anyways, likely someone told it to
"--discard-my-data" on one of the nodes (or "invalidate" or something to
that regard).
And if that has been the side with the data you lost,
well, then that someone told DRBD to throw it away.

: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
please don't Cc me, but send to list   --   I'm subscribed

More information about the drbd-user mailing list