[DRBD-user] Enabling two primaries break connection

Thomas Thrainer thomasth at google.com
Fri Apr 26 14:27:40 CEST 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

I've encountered a problem with DRBD 8.4.2 when I try to enable
--allow-two-primaries on the fly and immediately promoting the secondary to
primary afterwards.
The problem doesn't occur always, and it seems like it is more likely to
happen when there is more load on the device.

The exact command sequence is as follows:

Executed on primary and secondary node simultaneously (but also happens if
only executed on secondary):

drbdsetup net-options ipv4:<loc_ip>:11001 ipv4:<rem_ip>:11001 --protocol C
--after-sb-0pri discard-zero-changes --after-sb-1pri consensus
--allow-two-primaries=yes --cram-hmac-alg md5 --shared-secret <secret>
drbdsetup primary 1

BTW, the only options which differs in regard to the previously issued
drbdsetup connect command is --allow-two-primaries. The rest (protocol,
secret, etc.) are just repeated.

The outcome is that both nodes end up in the StandAlone state.

Their respective kernel log messages are:

(Old) primary:
Apr 26 11:19:42 primary kernel: [181721.646750] block drbd0: peer(
Secondary -> Primary )
Apr 26 11:19:42 primary kernel: [181721.669870] block drbd1: peer(
Secondary -> Primary )
Apr 26 11:19:42 primary kernel: [181722.057848] d-con resource1: sock was
shut down by peer
Apr 26 11:19:42 primary kernel: [181722.057872] d-con resource1: peer(
Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate ->
DUnknown )
Apr 26 11:19:42 primary kernel: [181722.057881] d-con resource1: short read
(expected size 16)
Apr 26 11:19:42 primary kernel: [181722.057914] block drbd1: new current
UUID DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
Apr 26 11:19:42 primary kernel: [181722.057964] d-con resource1: asender
terminated
Apr 26 11:19:42 primary kernel: [181722.057977] d-con resource1:
Terminating asender thread
Apr 26 11:19:42 primary kernel: [181722.058485] d-con resource1: Connection
closed
Apr 26 11:19:42 primary kernel: [181722.067019] d-con resource1: conn(
BrokenPipe -> Unconnected )
Apr 26 11:19:42 primary kernel: [181722.067027] d-con resource1: receiver
terminated
Apr 26 11:19:42 primary kernel: [181722.067032] d-con resource1: Restarting
receiver thread
Apr 26 11:19:42 primary kernel: [181722.067036] d-con resource1: receiver
(re)started
Apr 26 11:19:42 primary kernel: [181722.067045] d-con resource1: conn(
Unconnected -> WFConnection )
Apr 26 11:19:43 primary kernel: [181722.558370] d-con resource1: Handshake
successful: Agreed network protocol version 101
Apr 26 11:19:43 primary kernel: [181722.558702] d-con resource1: Peer
authenticated using 16 bytes HMAC
Apr 26 11:19:43 primary kernel: [181722.558747] d-con resource1: conn(
WFConnection -> WFReportParams )
Apr 26 11:19:43 primary kernel: [181722.558754] d-con resource1: Starting
asender thread (from drbd_r_resource [2039])
Apr 26 11:19:43 primary kernel: [181722.560436] block drbd1:
drbd_sync_handshake:
Apr 26 11:19:43 primary kernel: [181722.560445] block drbd1: self
DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
bits:3072 flags:0
Apr 26 11:19:43 primary kernel: [181722.560454] block drbd1: peer
9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0
flags:0
Apr 26 11:19:43 primary kernel: [181722.560466] block drbd1:
uuid_compare()=100 by rule 90
Apr 26 11:19:43 primary kernel: [181722.560474] block drbd1: helper
command: /bin/true initial-split-brain minor-1
Apr 26 11:19:43 primary kernel: [181722.565127] d-con resource1: conn(
WFReportParams -> NetworkFailure )
Apr 26 11:19:43 primary kernel: [181722.565134] d-con resource1: asender
terminated
Apr 26 11:19:43 primary kernel: [181722.565138] d-con resource1:
Terminating asender thread
Apr 26 11:19:43 primary kernel: [181722.570459] block drbd1: helper
command: /bin/true initial-split-brain minor-1 exit code 0 (0x0)
Apr 26 11:19:43 primary kernel: [181722.570488] block drbd1: helper
command: /bin/true split-brain minor-1
Apr 26 11:19:43 primary kernel: [181722.583047] block drbd1: helper
command: /bin/true split-brain minor-1 exit code 0 (0x0)
Apr 26 11:19:43 primary kernel: [181722.583073] d-con resource1: conn(
NetworkFailure -> Disconnecting )
Apr 26 11:19:43 primary kernel: [181722.583143] d-con resource1: Connection
closed
Apr 26 11:19:43 primary kernel: [181722.586237] d-con resource1: conn(
Disconnecting -> StandAlone )
Apr 26 11:19:43 primary kernel: [181722.586245] d-con resource1: receiver
terminated
Apr 26 11:19:43 primary kernel: [181722.586249] d-con resource1:
Terminating receiver thread
Apr 26 11:19:46 primary kernel: [181726.054479] br974: port 2(vif126.0)
entering forwarding state
Apr 26 11:19:46 primary kernel: [181726.058824] br974: port 2(vif126.0)
entering disabled state

(Old) secondary:
Apr 26 11:19:42 secondary kernel: [1809212.315376] block drbd0: role(
Secondary -> Primary )
Apr 26 11:19:42 secondary kernel: [1809212.338517] block drbd1: role(
Secondary -> Primary )
Apr 26 11:19:42 secondary kernel: [1809212.726247] d-con resource1: peer(
Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate ->
DUnknown )
Apr 26 11:19:42 secondary kernel: [1809212.726278] block drbd1: new current
UUID 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF
Apr 26 11:19:42 secondary kernel: [1809212.726310] d-con resource1: asender
terminated
Apr 26 11:19:42 secondary kernel: [1809212.726340] d-con resource1:
Terminating asender thread
Apr 26 11:19:42 secondary kernel: [1809212.726719] d-con resource1:
Connection closed
Apr 26 11:19:42 secondary kernel: [1809212.726749] d-con resource1: conn(
ProtocolError -> Unconnected )
Apr 26 11:19:42 secondary kernel: [1809212.726755] d-con resource1:
receiver terminated
Apr 26 11:19:42 secondary kernel: [1809212.726759] d-con resource1:
Restarting receiver thread
Apr 26 11:19:42 secondary kernel: [1809212.726763] d-con resource1:
receiver (re)started
Apr 26 11:19:42 secondary kernel: [1809212.726771] d-con resource1: conn(
Unconnected -> WFConnection )
Apr 26 11:19:43 secondary kernel: [1809213.226864] d-con resource1:
Handshake successful: Agreed network protocol version 101
Apr 26 11:19:43 secondary kernel: [1809213.227199] d-con resource1: Peer
authenticated using 16 bytes HMAC
Apr 26 11:19:43 secondary kernel: [1809213.227238] d-con resource1: conn(
WFConnection -> WFReportParams )
Apr 26 11:19:43 secondary kernel: [1809213.227245] d-con resource1:
Starting asender thread (from drbd_r_resource [20607])
Apr 26 11:19:43 secondary kernel: [1809213.231289] block drbd1:
drbd_sync_handshake:
Apr 26 11:19:43 secondary kernel: [1809213.231297] block drbd1: self
9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0
flags:0
Apr 26 11:19:43 secondary kernel: [1809213.231306] block drbd1: peer
DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
bits:3072 flags:0
Apr 26 11:19:43 secondary kernel: [1809213.231315] block drbd1:
uuid_compare()=100 by rule 90
Apr 26 11:19:43 secondary kernel: [1809213.231322] block drbd1: helper
command: /bin/true initial-split-brain minor-1
Apr 26 11:19:43 secondary kernel: [1809213.232460] block drbd1: helper
command: /bin/true initial-split-brain minor-1 exit code 0 (0x0)
Apr 26 11:19:43 secondary kernel: [1809213.232494] block drbd1: helper
command: /bin/true split-brain minor-1
Apr 26 11:19:43 secondary kernel: [1809213.233512] block drbd1: helper
command: /bin/true split-brain minor-1 exit code 0 (0x0)
Apr 26 11:19:43 secondary kernel: [1809213.233539] d-con resource1: conn(
WFReportParams -> Disconnecting )
Apr 26 11:19:43 secondary kernel: [1809213.233574] d-con resource1: asender
terminated
Apr 26 11:19:43 secondary kernel: [1809213.233579] d-con resource1:
Terminating asender thread
Apr 26 11:19:43 secondary kernel: [1809213.233631] d-con resource1:
Connection closed
Apr 26 11:19:43 secondary kernel: [1809213.233662] d-con resource1: conn(
Disconnecting -> StandAlone )
Apr 26 11:19:43 secondary kernel: [1809213.233667] d-con resource1:
receiver terminated
Apr 26 11:19:43 secondary kernel: [1809213.233672] d-con resource1:
Terminating receiver thread


What am I doing wrong? Is there a requirement to wait for a
sync/propagation of properties/random amount of time before promoting the
secondary to primary? Is this a bug?

Thanks,
Thomas

-- 
Thomas Thrainer | Software Engineer | thomasth at google.com |

Google Germany GmbH
Dienerstr. 12
80331 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Katherine Stephens
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130426/c74da9e1/attachment.htm>


More information about the drbd-user mailing list