[DRBD-user] Enabling two primaries break connection

Thomas Thrainer thomasth at google.com
Fri Apr 26 16:14:35 CEST 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi Luca,

(CC'd drbd-user, I guess that might be helpful for others as well)

We're not using drbdadm but drbdsetup directly.

I tried `drbdsetup net-options ipv4:<local_ip>:11001 ipv4:<remote_ip>:11001
--protocol C --allow-two-
primaries=yes` (i.e. I stripped the repeated options), but the result is
still the same.
Note however, that the problem occurs only every now and then, and
primarily when there is load on the disk(s).

BTW, I actually do set two disks to dual-primary mode at the same time
(using different connections/resources tough), and one disk normally works
while the other fails (is't not deterministic which of disk fails).

Cheers,
Thomas


On Fri, Apr 26, 2013 at 3:58 PM, Luca Fornasari <luca.fornasari at gmail.com>wrote:

> Hi Thomas,
>
> Just execute the following on one node only:
>
> drbdadm net-options --protocol=C --allow-two-primaries r0
>
> I guess that the command you are issuing just try to restart an already
> running resource.
>
> Cheers,
> Luca
>
>
> On Fri, Apr 26, 2013 at 2:27 PM, Thomas Thrainer <thomasth at google.com>wrote:
>
>> Hi,
>>
>> I've encountered a problem with DRBD 8.4.2 when I try to enable
>> --allow-two-primaries on the fly and immediately promoting the secondary to
>> primary afterwards.
>> The problem doesn't occur always, and it seems like it is more likely to
>> happen when there is more load on the device.
>>
>> The exact command sequence is as follows:
>>
>> Executed on primary and secondary node simultaneously (but also happens
>> if only executed on secondary):
>>
>> drbdsetup net-options ipv4:<loc_ip>:11001 ipv4:<rem_ip>:11001 --protocol
>> C --after-sb-0pri discard-zero-changes --after-sb-1pri consensus
>> --allow-two-primaries=yes --cram-hmac-alg md5 --shared-secret <secret>
>> drbdsetup primary 1
>>
>> BTW, the only options which differs in regard to the previously issued
>> drbdsetup connect command is --allow-two-primaries. The rest (protocol,
>> secret, etc.) are just repeated.
>>
>> The outcome is that both nodes end up in the StandAlone state.
>>
>> Their respective kernel log messages are:
>>
>> (Old) primary:
>> Apr 26 11:19:42 primary kernel: [181721.646750] block drbd0: peer(
>> Secondary -> Primary )
>> Apr 26 11:19:42 primary kernel: [181721.669870] block drbd1: peer(
>> Secondary -> Primary )
>> Apr 26 11:19:42 primary kernel: [181722.057848] d-con resource1: sock was
>> shut down by peer
>> Apr 26 11:19:42 primary kernel: [181722.057872] d-con resource1: peer(
>> Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate ->
>> DUnknown )
>> Apr 26 11:19:42 primary kernel: [181722.057881] d-con resource1: short
>> read (expected size 16)
>> Apr 26 11:19:42 primary kernel: [181722.057914] block drbd1: new current
>> UUID DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
>> Apr 26 11:19:42 primary kernel: [181722.057964] d-con resource1: asender
>> terminated
>> Apr 26 11:19:42 primary kernel: [181722.057977] d-con resource1:
>> Terminating asender thread
>> Apr 26 11:19:42 primary kernel: [181722.058485] d-con resource1:
>> Connection closed
>> Apr 26 11:19:42 primary kernel: [181722.067019] d-con resource1: conn(
>> BrokenPipe -> Unconnected )
>> Apr 26 11:19:42 primary kernel: [181722.067027] d-con resource1: receiver
>> terminated
>> Apr 26 11:19:42 primary kernel: [181722.067032] d-con resource1:
>> Restarting receiver thread
>> Apr 26 11:19:42 primary kernel: [181722.067036] d-con resource1: receiver
>> (re)started
>> Apr 26 11:19:42 primary kernel: [181722.067045] d-con resource1: conn(
>> Unconnected -> WFConnection )
>> Apr 26 11:19:43 primary kernel: [181722.558370] d-con resource1:
>> Handshake successful: Agreed network protocol version 101
>> Apr 26 11:19:43 primary kernel: [181722.558702] d-con resource1: Peer
>> authenticated using 16 bytes HMAC
>> Apr 26 11:19:43 primary kernel: [181722.558747] d-con resource1: conn(
>> WFConnection -> WFReportParams )
>> Apr 26 11:19:43 primary kernel: [181722.558754] d-con resource1: Starting
>> asender thread (from drbd_r_resource [2039])
>> Apr 26 11:19:43 primary kernel: [181722.560436] block drbd1:
>> drbd_sync_handshake:
>> Apr 26 11:19:43 primary kernel: [181722.560445] block drbd1: self
>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
>> bits:3072 flags:0
>> Apr 26 11:19:43 primary kernel: [181722.560454] block drbd1: peer
>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0
>> flags:0
>> Apr 26 11:19:43 primary kernel: [181722.560466] block drbd1:
>> uuid_compare()=100 by rule 90
>> Apr 26 11:19:43 primary kernel: [181722.560474] block drbd1: helper
>> command: /bin/true initial-split-brain minor-1
>> Apr 26 11:19:43 primary kernel: [181722.565127] d-con resource1: conn(
>> WFReportParams -> NetworkFailure )
>> Apr 26 11:19:43 primary kernel: [181722.565134] d-con resource1: asender
>> terminated
>> Apr 26 11:19:43 primary kernel: [181722.565138] d-con resource1:
>> Terminating asender thread
>> Apr 26 11:19:43 primary kernel: [181722.570459] block drbd1: helper
>> command: /bin/true initial-split-brain minor-1 exit code 0 (0x0)
>> Apr 26 11:19:43 primary kernel: [181722.570488] block drbd1: helper
>> command: /bin/true split-brain minor-1
>> Apr 26 11:19:43 primary kernel: [181722.583047] block drbd1: helper
>> command: /bin/true split-brain minor-1 exit code 0 (0x0)
>> Apr 26 11:19:43 primary kernel: [181722.583073] d-con resource1: conn(
>> NetworkFailure -> Disconnecting )
>> Apr 26 11:19:43 primary kernel: [181722.583143] d-con resource1:
>> Connection closed
>> Apr 26 11:19:43 primary kernel: [181722.586237] d-con resource1: conn(
>> Disconnecting -> StandAlone )
>> Apr 26 11:19:43 primary kernel: [181722.586245] d-con resource1: receiver
>> terminated
>> Apr 26 11:19:43 primary kernel: [181722.586249] d-con resource1:
>> Terminating receiver thread
>> Apr 26 11:19:46 primary kernel: [181726.054479] br974: port 2(vif126.0)
>> entering forwarding state
>> Apr 26 11:19:46 primary kernel: [181726.058824] br974: port 2(vif126.0)
>> entering disabled state
>>
>> (Old) secondary:
>> Apr 26 11:19:42 secondary kernel: [1809212.315376] block drbd0: role(
>> Secondary -> Primary )
>> Apr 26 11:19:42 secondary kernel: [1809212.338517] block drbd1: role(
>> Secondary -> Primary )
>> Apr 26 11:19:42 secondary kernel: [1809212.726247] d-con resource1: peer(
>> Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate ->
>> DUnknown )
>> Apr 26 11:19:42 secondary kernel: [1809212.726278] block drbd1: new
>> current UUID
>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF
>> Apr 26 11:19:42 secondary kernel: [1809212.726310] d-con resource1:
>> asender terminated
>> Apr 26 11:19:42 secondary kernel: [1809212.726340] d-con resource1:
>> Terminating asender thread
>> Apr 26 11:19:42 secondary kernel: [1809212.726719] d-con resource1:
>> Connection closed
>> Apr 26 11:19:42 secondary kernel: [1809212.726749] d-con resource1: conn(
>> ProtocolError -> Unconnected )
>> Apr 26 11:19:42 secondary kernel: [1809212.726755] d-con resource1:
>> receiver terminated
>> Apr 26 11:19:42 secondary kernel: [1809212.726759] d-con resource1:
>> Restarting receiver thread
>> Apr 26 11:19:42 secondary kernel: [1809212.726763] d-con resource1:
>> receiver (re)started
>> Apr 26 11:19:42 secondary kernel: [1809212.726771] d-con resource1: conn(
>> Unconnected -> WFConnection )
>> Apr 26 11:19:43 secondary kernel: [1809213.226864] d-con resource1:
>> Handshake successful: Agreed network protocol version 101
>> Apr 26 11:19:43 secondary kernel: [1809213.227199] d-con resource1: Peer
>> authenticated using 16 bytes HMAC
>> Apr 26 11:19:43 secondary kernel: [1809213.227238] d-con resource1: conn(
>> WFConnection -> WFReportParams )
>> Apr 26 11:19:43 secondary kernel: [1809213.227245] d-con resource1:
>> Starting asender thread (from drbd_r_resource [20607])
>> Apr 26 11:19:43 secondary kernel: [1809213.231289] block drbd1:
>> drbd_sync_handshake:
>> Apr 26 11:19:43 secondary kernel: [1809213.231297] block drbd1: self
>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0
>> flags:0
>> Apr 26 11:19:43 secondary kernel: [1809213.231306] block drbd1: peer
>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
>> bits:3072 flags:0
>> Apr 26 11:19:43 secondary kernel: [1809213.231315] block drbd1:
>> uuid_compare()=100 by rule 90
>> Apr 26 11:19:43 secondary kernel: [1809213.231322] block drbd1: helper
>> command: /bin/true initial-split-brain minor-1
>> Apr 26 11:19:43 secondary kernel: [1809213.232460] block drbd1: helper
>> command: /bin/true initial-split-brain minor-1 exit code 0 (0x0)
>> Apr 26 11:19:43 secondary kernel: [1809213.232494] block drbd1: helper
>> command: /bin/true split-brain minor-1
>> Apr 26 11:19:43 secondary kernel: [1809213.233512] block drbd1: helper
>> command: /bin/true split-brain minor-1 exit code 0 (0x0)
>> Apr 26 11:19:43 secondary kernel: [1809213.233539] d-con resource1: conn(
>> WFReportParams -> Disconnecting )
>> Apr 26 11:19:43 secondary kernel: [1809213.233574] d-con resource1:
>> asender terminated
>> Apr 26 11:19:43 secondary kernel: [1809213.233579] d-con resource1:
>> Terminating asender thread
>> Apr 26 11:19:43 secondary kernel: [1809213.233631] d-con resource1:
>> Connection closed
>> Apr 26 11:19:43 secondary kernel: [1809213.233662] d-con resource1: conn(
>> Disconnecting -> StandAlone )
>> Apr 26 11:19:43 secondary kernel: [1809213.233667] d-con resource1:
>> receiver terminated
>> Apr 26 11:19:43 secondary kernel: [1809213.233672] d-con resource1:
>> Terminating receiver thread
>>
>>
>> What am I doing wrong? Is there a requirement to wait for a
>> sync/propagation of properties/random amount of time before promoting the
>> secondary to primary? Is this a bug?
>>
>> Thanks,
>> Thomas
>>
>> --
>> Thomas Thrainer | Software Engineer | thomasth at google.com |
>>
>>  Google Germany GmbH
>> Dienerstr. 12
>> 80331 München
>>
>> Registergericht und -nummer: Hamburg, HRB 86891
>> Sitz der Gesellschaft: Hamburg
>> Geschäftsführer: Graham Law, Katherine Stephens
>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
>>
>


-- 
Thomas Thrainer | Software Engineer | thomasth at google.com |

Google Germany GmbH
Dienerstr. 12
80331 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Katherine Stephens
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130426/99542de9/attachment.htm>


More information about the drbd-user mailing list