[DRBD-user] Default Split Brain Behaviour

Lars Ellenberg lars.ellenberg at linbit.com
Thu Jan 27 09:51:39 CET 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Tue, Jan 25, 2011 at 10:36:03AM +1100, Lew wrote:
> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905963] block drbd9: drbd_sync_handshake:
> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905967] block drbd9: self 49615ABF1622FC55:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807 bits:143432 flags:0
> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905971] block drbd9: peer 6116B0558277E470:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807 bits:336381 flags:0

There. Both nodes have changes the other node did not see (yet).
That's where DRBD can detect that there previously has been data
divergence, usually caused by cluster split brain.

> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.905975] block drbd9: uuid_compare()=100 by rule 90
> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.906273] block drbd9: helper command: /sbin/drbdadm split-brain minor-9
> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.937925] block drbd9: conn( WFReportParams -> NetworkFailure ) 
> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.937935] block drbd9: asender terminated
> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.937938] block drbd9: Terminating asender thread
> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.950821] block drbd9: helper command: /sbin/drbdadm split-brain minor-9 exit code 127 (0x7f00)
> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.950827] block drbd9: conn( NetworkFailure -> Disconnecting ) 
> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.951122] block drbd9: Connection closed
> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.951129] block drbd9: conn( Disconnecting -> StandAlone ) 
> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.951149] block drbd9: receiver terminated
> Jan 23 22:19:35 emlsurit-v4 kernel: [25910.951151] block drbd9: Terminating receiver thread

Which is detected. DRBD cannot decide which version of your data you'd rather keep,
so the default behaviour is to drop the network connection, and no longer talk to the peer.

But 15 minutes later, you decide to try again to connect them,

> Jan 23 22:34:37 emlsurit-v4 kernel: [26811.487616] block drbd9: conn( StandAlone -> Unconnected ) 
> Jan 23 22:34:37 emlsurit-v4 kernel: [26811.487638] block drbd9: Starting receiver thread (from drbd9_worker [2126])
> Jan 23 22:34:37 emlsurit-v4 kernel: [26811.487690] block drbd9: receiver (re)started
> Jan 23 22:34:37 emlsurit-v4 kernel: [26811.487696] block drbd9: conn( Unconnected -> WFConnection ) 
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.182513] block drbd9: Handshake successful: Agreed network protocol version 91
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.182522] block drbd9: conn( WFConnection -> WFReportParams ) 
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.182539] block drbd9: Starting asender thread (from drbd9_receiver [20045])
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.183313] block drbd9: data-integrity-alg: <not-used>
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.183340] block drbd9: drbd_sync_handshake:
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.183345] block drbd9: self 49615ABF1622FC55:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807 bits:143799 flags:0
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.183349] block drbd9: peer 6116B0558277E470:643454BA1CA67140:5625CFAB3DDD24A2:EA5079D16F8C7807 bits:336381 flags:0


DRBD notices that you still have not decided which version to use,
and we can see that currently, emsulrit-v4 is still being actively
modified (we cannot be sure about the other node, though).

> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.183353] block drbd9: uuid_compare()=100 by rule 90
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.183610] block drbd9: helper command: /sbin/drbdadm split-brain minor-9
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.192301] block drbd9: conn( WFReportParams -> NetworkFailure ) 
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.192309] block drbd9: asender terminated
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.192311] block drbd9: Terminating asender thread
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.192702] block drbd9: helper command: /sbin/drbdadm split-brain minor-9 exit code 127 (0x7f00)
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.192709] block drbd9: conn( NetworkFailure -> Disconnecting ) 
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.193004] block drbd9: Connection closed
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.193012] block drbd9: conn( Disconnecting -> StandAlone ) 

And again, the connection is dropped.

> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.193027] block drbd9: receiver terminated
> Jan 23 22:35:04 emlsurit-v4 kernel: [26838.193029] block drbd9: Terminating receiver thread
> Jan 23 22:35:58 emlsurit-v4 kernel: [26892.356300] block drbd9: conn( StandAlone -> Unconnected ) 
> Jan 23 22:35:58 emlsurit-v4 kernel: [26892.356326] block drbd9: Starting receiver thread (from drbd9_worker [2126])
> Jan 23 22:35:58 emlsurit-v4 kernel: [26892.356519] block drbd9: receiver (re)started
> Jan 23 22:35:58 emlsurit-v4 kernel: [26892.356527] block drbd9: conn( Unconnected -> WFConnection ) 


So... My guess is, that you still have two versions of your data.

>From this log, there was no sync, because DRBD default behaviour in that
case it to disconnect. Therefore no rollback, and no data loss.
But you certainly have diverging data sets, and my guess is they keep
diverging still.

You have to figure out when they started to diverge, and why.
And you have to sort it out, decide which to keep,
and tell DRBD (see the User's Guide for details on this).

Consider booking DRBD Training

	;-)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list