[DRBD-user] Trying to resolve a Split-Brain on stacked resources

Tue Nov 8 06:42:01 CET 2016

Hi everybody,

Configuration:

DRBD 8.4.x, running on SuSE (SLES 11 SP3, SLE 11 SP3 HAE)

Node-1, Node-2 have a DRBD resource at Site A, Node-1 is the Master.
Node-3, Node-4 have a DRBD resource at Site B, Node-4 is the Master.

A Stacked resource exists between the two sites on the two masters
(typical off-site DR configuration), e.g. Stacked master resource on
Node-4 is syncing to Stacked slave resource on Node-1.

Scenario: Replication between Node-4 and Node-3 broke down without any
notice, and time passed to allow the two to be substantially out of
sync.  Replication between the two stacked resources remained up to
date.  A power hit was experienced; Node-4 crashed and became
unrecoverable (Rebuild level event), Node-3 went down, came back up, and
assumed an operative state, albeit with "out-of-date" content.  It was
only realized that the DRBD resource was out of date after the VM using
the device came up and showed its drives to be out of date.  The stacked
resource, now on Node-3, attempted to synchronize to the remote site,
and Node-1 went into a StandAlone state.  Content to the VM was restored
via backup media, and the StandAlone condition went unaddressed for
days.  We are now trying to resolve the issue by invalidating the
resource at Site A (Nodes 1 and 2) and initiating synchronization again.
Following the instructions provided in the DRBD 8.4 manual, we attempted
to invalidate the stacked resource (discard-my-data), locally on Node-1
(drbdadm connect --discard-my-data --stacked shs0) and reconnect.  No
error messages were experienced.  The procedure is listed here:

https://www.drbd.org/en/doc/users-guide-84/s-resolve-split-brain

When trying to connect from Site B, Node-3 resolves to go into
WFConnection mode, waiting for us to initiate the connection on the
stacked resource on Node-1 . The syslog reports:

[code]
Oct 21 15:37:40 sannode-3 kernel: [4437224.524701] drbd drbdshs0:
conn( StandAlone -> Unconnected ) 
Oct 21 15:37:40 sannode-3 kernel: [4437224.524726] drbd drbdshs0:
Starting receiver thread (from drbd_w_drbdshs0 [6384])
Oct 21 15:37:40 sannode-3 kernel: [4437224.524759] drbd drbdshs0:
receiver (re)started
Oct 21 15:37:40 sannode-3 kernel: [4437224.524768] drbd drbdshs0:
conn( Unconnected -> WFConnection ) 
[/code]

Node-3 is now in WFConnection state; I start up DRBD on remote slave...

[code]
Oct 21 15:37:44 sannode-3 kernel: [4437228.529771] drbd drbdshs0:
Handshake successful: Agreed network protocol version 101
Oct 21 15:37:44 sannode-3 kernel: [4437228.529776] drbd drbdshs0: Agreed
to support TRIM on protocol level
Oct 21 15:37:44 sannode-3 kernel: [4437228.529813] drbd drbdshs0:
conn( WFConnection -> WFReportParams ) 
Oct 21 15:37:44 sannode-3 kernel: [4437228.529817] drbd drbdshs0:
Starting asender thread (from drbd_r_drbdshs0 [31747])
Oct 21 15:37:44 sannode-3 kernel: [4437228.570124] drbd drbdshs0: meta
connection shut down by peer.
Oct 21 15:37:44 sannode-3 kernel: [4437228.570136] drbd drbdshs0:
conn( WFReportParams -> NetworkFailure )
Oct 21 15:37:44 sannode-3 kernel: [4437228.570139] drbd drbdshs0:
asender terminated
Oct 21 15:37:44 sannode-3 kernel: [4437228.570142] drbd drbdshs0:
Terminating drbd_a_drbdshs0
Oct 21 15:37:44 sannode-3 kernel: [4437228.596058] drbd drbdshs0:
Connection closed
Oct 21 15:37:44 sannode-3 kernel: [4437228.596069] drbd drbdshs0:
conn( NetworkFailure -> Unconnected ) 
Oct 21 15:37:44 sannode-3 kernel: [4437228.596071] drbd drbdshs0:
receiver terminated
Oct 21 15:37:44 sannode-3 kernel: [4437228.596074] drbd drbdshs0:
Restarting receiver thread
Oct 21 15:37:44 sannode-3 kernel: [4437228.596075] drbd drbdshs0:
receiver (re)started
Oct 21 15:37:44 sannode-3 kernel: [4437228.596082] drbd drbdshs0:
conn( Unconnected -> WFConnection )
[/code]

Here is when you can see the communications halted.  Node 1 shows the
following:

[code]
Oct 21 15:37:44 sannode-1 kernel: [32160259.504298] drbd drbdshs0:
Handshake successful: Agreed network protocol version 101
Oct 21 15:37:44 sannode-1 kernel: [32160259.504301] drbd drbdshs0:
Agreed to support TRIM on protocol level
Oct 21 15:37:44 sannode-1 kernel: [32160259.504323] drbd drbdshs0:
conn( WFConnection -> WFReportParams ) 
Oct 21 15:37:44 sannode-1 kernel: [32160259.504326] drbd drbdshs0:
Starting asender thread (from drbd_r_drbdshs0 [30891])
Oct 21 15:37:44 sannode-1 kernel: [32160259.544221] block drbd11:
drbd_sync_handshake:
Oct 21 15:37:44 sannode-1 kernel: [32160259.544228] block drbd11: self
81C3555EEBB82B58:0000000000000000:2EBCE93D6234FA60:0880A6A6D6570EC8
bits:33806302 flags:0
Oct 21 15:37:44 sannode-1 kernel: [32160259.544233] block drbd11: peer
1AF1273B67FCF78D:9CB8F03A760C7A00:CD2C6818B3E08A20:CD2B6818B3E08A20
bits:33806302 flags:2
Oct 21 15:37:44 sannode-1 kernel: [32160259.544238] block drbd11:
uuid_compare()=-1000 by rule 100
Oct 21 15:37:44 sannode-1 kernel: [32160259.544240] block drbd11:
Unrelated data, aborting!
[/code]

You can see here the comparison check takes place, unrelated data is
discovered...

[code]
Oct 21 15:37:44 sannode-1 kernel: [32160259.544253] drbd drbdshs0:
conn( WFReportParams -> Disconnecting ) 
Oct 21 15:37:44 sannode-1 kernel: [32160259.544256] drbd drbdshs0: error
receiving ReportState, e: -5 l: 0!
Oct 21 15:37:44 sannode-1 kernel: [32160259.544267] drbd drbdshs0:
asender terminated
Oct 21 15:37:44 sannode-1 kernel: [32160259.544271] drbd drbdshs0:
Terminating drbd_a_drbdshs0
Oct 21 15:37:44 sannode-1 kernel: [32160259.544339] drbd drbdshs0:
Connection closed
Oct 21 15:37:44 sannode-1 kernel: [32160259.544347] drbd drbdshs0:
conn( Disconnecting -> StandAlone ) 
Oct 21 15:37:44 sannode-1 kernel: [32160259.544350] drbd drbdshs0:
receiver terminated
Oct 21 15:37:44 sannode-1 kernel: [32160259.544352] drbd drbdshs0:
Terminating drbd_r_drbdshs0
[/code]

...and Node-1 cycles back to StandAlone mode.

Node-1 is apparently finding the split-brain, presumably on the
underlying drbd resource, and terminating the re-sync, despite being
told to invalidate its data on the stacked resource.  We attempted the
"Invalidate-Remote" from Node-3 to Node-1, but that also did not work.
I'm assuming at this point that the only way to resolve this is to wipe
out the first few hundred bits of the metadata stored on the stacked
resource on Node-1, as well as wipe the metadata on the underlying
resource between Node-1 and Node-2, and create new metadata from scratch
- focus word here is "assume".  I'm looking for confirmation and
guidance on this assumption.  Please comment at your earliest
convenience.  If there are other approaches, I am happy to entertain
them.

Regards,
-- 

Elliott R. Scott
Scott Solutions LLC
IT Consulting & Support

P. O. Box 203
Liberty Hill, TX
(508) 451-8227
http://www.scottsolutions.us
Harness the Power of Scott Solutions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20161107/30260ef0/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Scott-Solutions-Logo-mini.png
Type: image/png
Size: 56877 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20161107/30260ef0/attachment.png>