[DRBD-user] Not able to test Automatic split brain recovery policies

Shailesh Vaidya shailesh_vaidya at persistent.co.in
Fri Apr 12 13:24:47 CEST 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

Ok. I guess fencing would be best option to avoid split-brain. However my current priority is to automatically recover from split-brain without manual intervention.

And during my test , when Primary goes down and secondary switch over.. it sense split-brain but drops connection

Apr 12 07:06:31 localhost kernel: block drbd0: Starting receiver thread (from drbd0_worker [5181])
Apr 12 07:06:31 localhost kernel: block drbd0: receiver (re)started
Apr 12 07:06:31 localhost kernel: block drbd0: conn( Unconnected -> WFConnection )
Apr 12 07:06:31 localhost kernel: block drbd0: Handshake successful: Agreed network protocol version 94
Apr 12 07:06:31 localhost kernel: block drbd0: conn( WFConnection -> WFReportParams )
Apr 12 07:06:31 localhost kernel: block drbd0: Starting asender thread (from drbd0_receiver [11349])
Apr 12 07:06:31 localhost kernel: block drbd0: data-integrity-alg: <not-used>
Apr 12 07:06:31 localhost kernel: block drbd0: drbd_sync_handshake:
Apr 12 07:06:31 localhost kernel: block drbd0: self EF24AB43D644F478:4538A0E8F6C64A68:EE8B2B59973F3C4F:CC7A9FF240AA59C1 bits:0 flags:0
Apr 12 07:06:31 localhost kernel: block drbd0: peer DA286F4E1F9A88B3:4538A0E8F6C64A69:EE8B2B59973F3C4F:CC7A9FF240AA59C1 bits:1 flags:0
Apr 12 07:06:31 localhost kernel: block drbd0: uuid_compare()=100 by rule 90
Apr 12 07:06:31 localhost kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0
Apr 12 07:06:31 localhost kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
Apr 12 07:06:31 localhost kernel: block drbd0: Split-Brain detected but unresolved, dropping connection!
Apr 12 07:06:31 localhost kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0
Apr 12 07:06:31 localhost kernel: block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
Apr 12 07:06:31 localhost kernel: block drbd0: conn( WFReportParams -> Disconnecting )
Apr 12 07:06:31 localhost kernel: block drbd0: error receiving ReportState, l: 4!
Apr 12 07:06:31 localhost kernel: block drbd0: asender terminated
Apr 12 07:06:31 localhost kernel: block drbd0: Terminating asender thread
Apr 12 07:06:31 localhost kernel: block drbd0: Connection closed
Apr 12 07:06:31 localhost kernel: block drbd0: conn( Disconnecting -> StandAlone )

However if I do connection manually, split-brain is detected and resolved.

Apr 12 07:10:37 localhost kernel: block drbd0: Handshake successful: Agreed network protocol version 94
Apr 12 07:10:37 localhost kernel: block drbd0: conn( WFConnection -> WFReportParams )
Apr 12 07:10:37 localhost kernel: block drbd0: Starting asender thread (from drbd0_receiver [11369])
Apr 12 07:10:37 localhost kernel: block drbd0: data-integrity-alg: <not-used>
Apr 12 07:10:37 localhost kernel: block drbd0: drbd_sync_handshake:
Apr 12 07:10:37 localhost kernel: block drbd0: self EF24AB43D644F478:4538A0E8F6C64A68:EE8B2B59973F3C4F:CC7A9FF240AA59C1 bits:0 flags:0
Apr 12 07:10:37 localhost kernel: block drbd0: peer DA286F4E1F9A88B3:4538A0E8F6C64A69:EE8B2B59973F3C4F:CC7A9FF240AA59C1 bits:1 flags:0
Apr 12 07:10:37 localhost kernel: block drbd0: uuid_compare()=100 by rule 90
Apr 12 07:10:37 localhost kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0
Apr 12 07:10:37 localhost kernel: block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
Apr 12 07:10:37 localhost kernel: block drbd0: Split-Brain detected, 1 primaries, automatically solved. Sync from peer node
Apr 12 07:10:37 localhost kernel: block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Apr 12 07:10:37 localhost kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
Apr 12 07:10:37 localhost kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Apr 12 07:10:37 localhost kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Apr 12 07:10:37 localhost kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
Apr 12 07:10:37 localhost kernel: block drbd0: Began resync as SyncTarget (will sync 4 KB [1 bits set]).
Apr 12 07:10:37 localhost kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
Apr 12 07:10:37 localhost kernel: block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )

How to handle this automatically.

Regards,
Shailesh Vaidya


-----Original Message-----
From: drbd-user-bounces at lists.linbit.com [mailto:drbd-user-bounces at lists.linbit.com] On Behalf Of Prater, James K.
Sent: Friday, April 12, 2013 3:33 PM
To: 'lists at alteeve.ca'; 'dbarker at visioncomm.net'
Cc: 'drbd-user at lists.linbit.com'
Subject: Re: [DRBD-user] Not able to test Automatic split brain recovery policies

I have been finding that it is better to fence and not allow the bad peer to rejoin after reboot until the original issues is determined.   Auto-rejoin has the possibility of causing a multiple flip-flopping by way of reboot cycles.  Best is to fence-reboot-analyze-rejoin, of course this requires human intervention.

James

----- Original Message -----
From: Digimer [mailto:lists at alteeve.ca]
Sent: Thursday, April 11, 2013 02:43 PM
To: Dan Barker <dbarker at visioncomm.net>
Cc: drbd-user at lists.linbit.com <drbd-user at lists.linbit.com>
Subject: Re: [DRBD-user] Not able to test Automatic split brain recovery	policies

On 04/11/2013 08:27 AM, Dan Barker wrote:
>> -----Original Message-----
>> From: Shailesh Vaidya [mailto:shailesh_vaidya at persistent.co.in]
>> Sent: Thursday, April 11, 2013 1:50 AM
>> To: Digimer
>> Cc: Dan Barker; drbd-user at lists.linbit.com
>> Subject: RE: [DRBD-user] Not able to test Automatic split brain 
>> recovery policies
>>
>> Hi Digimer,
>>
>> Thanks for help and explanation. I will try it out fencing option.
>>
>> However, I would like to validate if what I am testing for 
>> split-brain is correct or not. Also what could be done for simple 
>> split-brain auto- recovery through configuration without fencing.
>>
>
> There is no "simple split-brain" recovery. Split Brain only occurs after an error of some sort causing two different nodes to write to the same resource while disconnected. Anything other than manual recovery of files or blocks will lose data. In many cases, it's not even possible to determine what data is being lost or how to recover it. You just have to pick the lesser of two evils and move forward, honoring the writes to one node and discarding the writes done on the other. Most applications and file systems react poorly to having writes of theirs discarded.
>
> Any effort spent automating the recovery of a split-brain could better be spent identifying how your configuration created the split brain, usually dual primary without sufficient controls in place to prevent split brain in the first place.
>
> ymmv
>
> Dan

To build on Dan's comments;

Automatic split-brain recovery where both nodes where StandAlone and Primary is not possible. Consider this;

Say you want to recover by discarding the node with the least changes;

* Node 1 has an easily replaceable ISO written to it.
* Node 2 has accounting data written to it.

A human would know to discard Node 1, obviously, but "least changes" 
would cause node 2 to get overwritten.

Say you want to recover by discarding oldest changes; Repeat the above example, but say that you record the accounting data an hour before the ISO is written. No better.

The only safe way to recover from a split-brain is to bring up the node you want to invalidate in StandAlone, mount the DRBD backed FS or VM, backup all the data to somewhere else, invalidate it, connect it to the still-UpToDate node and let syncing begin and then manually merge the just-backed up data into the now-resync'ing DRBD-backed data.

This is clumsy, prone to human errors and might well be very difficult or impossible, depending on the type of data stored on the DRBD resource.

*By far* the better option is to do everything you can to avoid a split-brain in the first place.

To test that you have accomplished that;

Setup fencing and then repeat your tests where you break the network connection. You should then see one node get rebooted and the remaining node continue. Once the fenced node powers back up, it should rejoin the good node without complaining about a split-brain. So if the rebooted node automatically rejoins, you know your configuration is working properly.

Another good test is to crash each node using 'echo c > /proc/sysrq-trigger'. You should see that the healthy node reboots the other node. If you have used a delay against a node, you should be able to see the difference in recovery time doing this test as well.

digimer

--
Digimer
Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?
_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.



More information about the drbd-user mailing list