Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi everybody, Configuration: DRBD 8.4.x, running on SuSE (SLES 11 SP3, SLE 11 SP3 HAE) Node-1, Node-2 have a DRBD resource at Site A, Node-1 is the Master. Node-3, Node-4 have a DRBD resource at Site B, Node-4 is the Master. A Stacked resource exists between the two sites on the two masters (typical off-site DR configuration), e.g. Stacked master resource on Node-4 is syncing to Stacked slave resource on Node-1. Scenario: Replication between Node-4 and Node-3 broke down without any notice, and time passed to allow the two to be substantially out of sync. Replication between the two stacked resources remained up to date. A power hit was experienced; Node-4 crashed and became unrecoverable (Rebuild level event), Node-3 went down, came back up, and assumed an operative state, albeit with "out-of-date" content. It was only realized that the DRBD resource was out of date after the VM using the device came up and showed its drives to be out of date. The stacked resource, now on Node-3, attempted to synchronize to the remote site, and Node-1 went into a StandAlone state. Content to the VM was restored via backup media, and the StandAlone condition went unaddressed for days. We are now trying to resolve the issue by invalidating the resource at Site A (Nodes 1 and 2) and initiating synchronization again. Following the instructions provided in the DRBD 8.4 manual, we attempted to invalidate the stacked resource (discard-my-data), locally on Node-1 (drbdadm connect --discard-my-data --stacked shs0) and reconnect. No error messages were experienced. The procedure is listed here: https://www.drbd.org/en/doc/users-guide-84/s-resolve-split-brain When trying to connect from Site B, Node-3 resolves to go into WFConnection mode, waiting for us to initiate the connection on the stacked resource on Node-1 . The syslog reports: [code] Oct 21 15:37:40 sannode-3 kernel: [4437224.524701] drbd drbdshs0: conn( StandAlone -> Unconnected ) Oct 21 15:37:40 sannode-3 kernel: [4437224.524726] drbd drbdshs0: Starting receiver thread (from drbd_w_drbdshs0 [6384]) Oct 21 15:37:40 sannode-3 kernel: [4437224.524759] drbd drbdshs0: receiver (re)started Oct 21 15:37:40 sannode-3 kernel: [4437224.524768] drbd drbdshs0: conn( Unconnected -> WFConnection ) [/code] Node-3 is now in WFConnection state; I start up DRBD on remote slave... [code] Oct 21 15:37:44 sannode-3 kernel: [4437228.529771] drbd drbdshs0: Handshake successful: Agreed network protocol version 101 Oct 21 15:37:44 sannode-3 kernel: [4437228.529776] drbd drbdshs0: Agreed to support TRIM on protocol level Oct 21 15:37:44 sannode-3 kernel: [4437228.529813] drbd drbdshs0: conn( WFConnection -> WFReportParams ) Oct 21 15:37:44 sannode-3 kernel: [4437228.529817] drbd drbdshs0: Starting asender thread (from drbd_r_drbdshs0 [31747]) Oct 21 15:37:44 sannode-3 kernel: [4437228.570124] drbd drbdshs0: meta connection shut down by peer. Oct 21 15:37:44 sannode-3 kernel: [4437228.570136] drbd drbdshs0: conn( WFReportParams -> NetworkFailure ) Oct 21 15:37:44 sannode-3 kernel: [4437228.570139] drbd drbdshs0: asender terminated Oct 21 15:37:44 sannode-3 kernel: [4437228.570142] drbd drbdshs0: Terminating drbd_a_drbdshs0 Oct 21 15:37:44 sannode-3 kernel: [4437228.596058] drbd drbdshs0: Connection closed Oct 21 15:37:44 sannode-3 kernel: [4437228.596069] drbd drbdshs0: conn( NetworkFailure -> Unconnected ) Oct 21 15:37:44 sannode-3 kernel: [4437228.596071] drbd drbdshs0: receiver terminated Oct 21 15:37:44 sannode-3 kernel: [4437228.596074] drbd drbdshs0: Restarting receiver thread Oct 21 15:37:44 sannode-3 kernel: [4437228.596075] drbd drbdshs0: receiver (re)started Oct 21 15:37:44 sannode-3 kernel: [4437228.596082] drbd drbdshs0: conn( Unconnected -> WFConnection ) [/code] Here is when you can see the communications halted. Node 1 shows the following: [code] Oct 21 15:37:44 sannode-1 kernel: [32160259.504298] drbd drbdshs0: Handshake successful: Agreed network protocol version 101 Oct 21 15:37:44 sannode-1 kernel: [32160259.504301] drbd drbdshs0: Agreed to support TRIM on protocol level Oct 21 15:37:44 sannode-1 kernel: [32160259.504323] drbd drbdshs0: conn( WFConnection -> WFReportParams ) Oct 21 15:37:44 sannode-1 kernel: [32160259.504326] drbd drbdshs0: Starting asender thread (from drbd_r_drbdshs0 [30891]) Oct 21 15:37:44 sannode-1 kernel: [32160259.544221] block drbd11: drbd_sync_handshake: Oct 21 15:37:44 sannode-1 kernel: [32160259.544228] block drbd11: self 81C3555EEBB82B58:0000000000000000:2EBCE93D6234FA60:0880A6A6D6570EC8 bits:33806302 flags:0 Oct 21 15:37:44 sannode-1 kernel: [32160259.544233] block drbd11: peer 1AF1273B67FCF78D:9CB8F03A760C7A00:CD2C6818B3E08A20:CD2B6818B3E08A20 bits:33806302 flags:2 Oct 21 15:37:44 sannode-1 kernel: [32160259.544238] block drbd11: uuid_compare()=-1000 by rule 100 Oct 21 15:37:44 sannode-1 kernel: [32160259.544240] block drbd11: Unrelated data, aborting! [/code] You can see here the comparison check takes place, unrelated data is discovered... [code] Oct 21 15:37:44 sannode-1 kernel: [32160259.544253] drbd drbdshs0: conn( WFReportParams -> Disconnecting ) Oct 21 15:37:44 sannode-1 kernel: [32160259.544256] drbd drbdshs0: error receiving ReportState, e: -5 l: 0! Oct 21 15:37:44 sannode-1 kernel: [32160259.544267] drbd drbdshs0: asender terminated Oct 21 15:37:44 sannode-1 kernel: [32160259.544271] drbd drbdshs0: Terminating drbd_a_drbdshs0 Oct 21 15:37:44 sannode-1 kernel: [32160259.544339] drbd drbdshs0: Connection closed Oct 21 15:37:44 sannode-1 kernel: [32160259.544347] drbd drbdshs0: conn( Disconnecting -> StandAlone ) Oct 21 15:37:44 sannode-1 kernel: [32160259.544350] drbd drbdshs0: receiver terminated Oct 21 15:37:44 sannode-1 kernel: [32160259.544352] drbd drbdshs0: Terminating drbd_r_drbdshs0 [/code] ...and Node-1 cycles back to StandAlone mode. Node-1 is apparently finding the split-brain, presumably on the underlying drbd resource, and terminating the re-sync, despite being told to invalidate its data on the stacked resource. We attempted the "Invalidate-Remote" from Node-3 to Node-1, but that also did not work. I'm assuming at this point that the only way to resolve this is to wipe out the first few hundred bits of the metadata stored on the stacked resource on Node-1, as well as wipe the metadata on the underlying resource between Node-1 and Node-2, and create new metadata from scratch - focus word here is "assume". I'm looking for confirmation and guidance on this assumption. Please comment at your earliest convenience. If there are other approaches, I am happy to entertain them. Regards, -- Elliott R. Scott Scott Solutions LLC IT Consulting & Support P. O. Box 203 Liberty Hill, TX (508) 451-8227 http://www.scottsolutions.us Harness the Power of Scott Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20161107/30260ef0/attachment.htm> -------------- next part -------------- A non-text attachment was scrubbed... Name: Scott-Solutions-Logo-mini.png Type: image/png Size: 56877 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20161107/30260ef0/attachment.png>