[DRBD-user] Enabling two primaries break connection

Luca Fornasari luca.fornasari at gmail.com
Fri Apr 26 17:04:03 CEST 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi Thomas,

In line reply below

On Fri, Apr 26, 2013 at 4:14 PM, Thomas Thrainer <thomasth at google.com>wrote:

> Hi Luca,
>
> (CC'd drbd-user, I guess that might be helpful for others as well)
>

Just reply to the list; I'm subscribed ;)


> We're not using drbdadm but drbdsetup directly.
>
> I tried `drbdsetup net-options ipv4:<local_ip>:11001
> ipv4:<remote_ip>:11001 --protocol C --allow-two-
> primaries=yes` (i.e. I stripped the repeated options), but the result is
> still the same.
>

I' not 100% sure but I think that repeating ipv4:<local_ip>:local_port
ipv4:<remote_ip>:remote_port restart the connection; during off-load time
that happens fast enough while during high-load fails.
Just try to use "drbdadm --allow-two-primaries=yes r0" on one node only.
Do you have a good reason to use drbdsetup directly?

Cheers,
Luca


> Note however, that the problem occurs only every now and then, and
> primarily when there is load on the disk(s).
>
> BTW, I actually do set two disks to dual-primary mode at the same time
> (using different connections/resources tough), and one disk normally works
> while the other fails (is't not deterministic which of disk fails).
>
> Cheers,
> Thomas
>
>
> On Fri, Apr 26, 2013 at 3:58 PM, Luca Fornasari <luca.fornasari at gmail.com>wrote:
>
>> Hi Thomas,
>>
>> Just execute the following on one node only:
>>
>> drbdadm net-options --protocol=C --allow-two-primaries r0
>>
>> I guess that the command you are issuing just try to restart an already
>> running resource.
>>
>> Cheers,
>> Luca
>>
>>
>> On Fri, Apr 26, 2013 at 2:27 PM, Thomas Thrainer <thomasth at google.com>wrote:
>>
>>>  Hi,
>>>
>>> I've encountered a problem with DRBD 8.4.2 when I try to enable
>>> --allow-two-primaries on the fly and immediately promoting the secondary to
>>> primary afterwards.
>>> The problem doesn't occur always, and it seems like it is more likely to
>>> happen when there is more load on the device.
>>>
>>> The exact command sequence is as follows:
>>>
>>> Executed on primary and secondary node simultaneously (but also happens
>>> if only executed on secondary):
>>>
>>> drbdsetup net-options ipv4:<loc_ip>:11001 ipv4:<rem_ip>:11001 --protocol
>>> C --after-sb-0pri discard-zero-changes --after-sb-1pri consensus
>>> --allow-two-primaries=yes --cram-hmac-alg md5 --shared-secret <secret>
>>> drbdsetup primary 1
>>>
>>> BTW, the only options which differs in regard to the previously issued
>>> drbdsetup connect command is --allow-two-primaries. The rest (protocol,
>>> secret, etc.) are just repeated.
>>>
>>> The outcome is that both nodes end up in the StandAlone state.
>>>
>>> Their respective kernel log messages are:
>>>
>>> (Old) primary:
>>> Apr 26 11:19:42 primary kernel: [181721.646750] block drbd0: peer(
>>> Secondary -> Primary )
>>> Apr 26 11:19:42 primary kernel: [181721.669870] block drbd1: peer(
>>> Secondary -> Primary )
>>> Apr 26 11:19:42 primary kernel: [181722.057848] d-con resource1: sock
>>> was shut down by peer
>>> Apr 26 11:19:42 primary kernel: [181722.057872] d-con resource1: peer(
>>> Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate ->
>>> DUnknown )
>>> Apr 26 11:19:42 primary kernel: [181722.057881] d-con resource1: short
>>> read (expected size 16)
>>> Apr 26 11:19:42 primary kernel: [181722.057914] block drbd1: new current
>>> UUID DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
>>> Apr 26 11:19:42 primary kernel: [181722.057964] d-con resource1: asender
>>> terminated
>>> Apr 26 11:19:42 primary kernel: [181722.057977] d-con resource1:
>>> Terminating asender thread
>>> Apr 26 11:19:42 primary kernel: [181722.058485] d-con resource1:
>>> Connection closed
>>> Apr 26 11:19:42 primary kernel: [181722.067019] d-con resource1: conn(
>>> BrokenPipe -> Unconnected )
>>> Apr 26 11:19:42 primary kernel: [181722.067027] d-con resource1:
>>> receiver terminated
>>> Apr 26 11:19:42 primary kernel: [181722.067032] d-con resource1:
>>> Restarting receiver thread
>>> Apr 26 11:19:42 primary kernel: [181722.067036] d-con resource1:
>>> receiver (re)started
>>> Apr 26 11:19:42 primary kernel: [181722.067045] d-con resource1: conn(
>>> Unconnected -> WFConnection )
>>> Apr 26 11:19:43 primary kernel: [181722.558370] d-con resource1:
>>> Handshake successful: Agreed network protocol version 101
>>> Apr 26 11:19:43 primary kernel: [181722.558702] d-con resource1: Peer
>>> authenticated using 16 bytes HMAC
>>> Apr 26 11:19:43 primary kernel: [181722.558747] d-con resource1: conn(
>>> WFConnection -> WFReportParams )
>>> Apr 26 11:19:43 primary kernel: [181722.558754] d-con resource1:
>>> Starting asender thread (from drbd_r_resource [2039])
>>> Apr 26 11:19:43 primary kernel: [181722.560436] block drbd1:
>>> drbd_sync_handshake:
>>> Apr 26 11:19:43 primary kernel: [181722.560445] block drbd1: self
>>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
>>> bits:3072 flags:0
>>> Apr 26 11:19:43 primary kernel: [181722.560454] block drbd1: peer
>>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0
>>> flags:0
>>> Apr 26 11:19:43 primary kernel: [181722.560466] block drbd1:
>>> uuid_compare()=100 by rule 90
>>> Apr 26 11:19:43 primary kernel: [181722.560474] block drbd1: helper
>>> command: /bin/true initial-split-brain minor-1
>>> Apr 26 11:19:43 primary kernel: [181722.565127] d-con resource1: conn(
>>> WFReportParams -> NetworkFailure )
>>> Apr 26 11:19:43 primary kernel: [181722.565134] d-con resource1: asender
>>> terminated
>>> Apr 26 11:19:43 primary kernel: [181722.565138] d-con resource1:
>>> Terminating asender thread
>>> Apr 26 11:19:43 primary kernel: [181722.570459] block drbd1: helper
>>> command: /bin/true initial-split-brain minor-1 exit code 0 (0x0)
>>> Apr 26 11:19:43 primary kernel: [181722.570488] block drbd1: helper
>>> command: /bin/true split-brain minor-1
>>> Apr 26 11:19:43 primary kernel: [181722.583047] block drbd1: helper
>>> command: /bin/true split-brain minor-1 exit code 0 (0x0)
>>> Apr 26 11:19:43 primary kernel: [181722.583073] d-con resource1: conn(
>>> NetworkFailure -> Disconnecting )
>>> Apr 26 11:19:43 primary kernel: [181722.583143] d-con resource1:
>>> Connection closed
>>> Apr 26 11:19:43 primary kernel: [181722.586237] d-con resource1: conn(
>>> Disconnecting -> StandAlone )
>>> Apr 26 11:19:43 primary kernel: [181722.586245] d-con resource1:
>>> receiver terminated
>>> Apr 26 11:19:43 primary kernel: [181722.586249] d-con resource1:
>>> Terminating receiver thread
>>> Apr 26 11:19:46 primary kernel: [181726.054479] br974: port 2(vif126.0)
>>> entering forwarding state
>>> Apr 26 11:19:46 primary kernel: [181726.058824] br974: port 2(vif126.0)
>>> entering disabled state
>>>
>>> (Old) secondary:
>>> Apr 26 11:19:42 secondary kernel: [1809212.315376] block drbd0: role(
>>> Secondary -> Primary )
>>> Apr 26 11:19:42 secondary kernel: [1809212.338517] block drbd1: role(
>>> Secondary -> Primary )
>>> Apr 26 11:19:42 secondary kernel: [1809212.726247] d-con resource1:
>>> peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk(
>>> UpToDate -> DUnknown )
>>> Apr 26 11:19:42 secondary kernel: [1809212.726278] block drbd1: new
>>> current UUID
>>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF
>>> Apr 26 11:19:42 secondary kernel: [1809212.726310] d-con resource1:
>>> asender terminated
>>> Apr 26 11:19:42 secondary kernel: [1809212.726340] d-con resource1:
>>> Terminating asender thread
>>> Apr 26 11:19:42 secondary kernel: [1809212.726719] d-con resource1:
>>> Connection closed
>>> Apr 26 11:19:42 secondary kernel: [1809212.726749] d-con resource1:
>>> conn( ProtocolError -> Unconnected )
>>> Apr 26 11:19:42 secondary kernel: [1809212.726755] d-con resource1:
>>> receiver terminated
>>> Apr 26 11:19:42 secondary kernel: [1809212.726759] d-con resource1:
>>> Restarting receiver thread
>>> Apr 26 11:19:42 secondary kernel: [1809212.726763] d-con resource1:
>>> receiver (re)started
>>> Apr 26 11:19:42 secondary kernel: [1809212.726771] d-con resource1:
>>> conn( Unconnected -> WFConnection )
>>> Apr 26 11:19:43 secondary kernel: [1809213.226864] d-con resource1:
>>> Handshake successful: Agreed network protocol version 101
>>> Apr 26 11:19:43 secondary kernel: [1809213.227199] d-con resource1: Peer
>>> authenticated using 16 bytes HMAC
>>> Apr 26 11:19:43 secondary kernel: [1809213.227238] d-con resource1:
>>> conn( WFConnection -> WFReportParams )
>>> Apr 26 11:19:43 secondary kernel: [1809213.227245] d-con resource1:
>>> Starting asender thread (from drbd_r_resource [20607])
>>> Apr 26 11:19:43 secondary kernel: [1809213.231289] block drbd1:
>>> drbd_sync_handshake:
>>> Apr 26 11:19:43 secondary kernel: [1809213.231297] block drbd1: self
>>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0
>>> flags:0
>>> Apr 26 11:19:43 secondary kernel: [1809213.231306] block drbd1: peer
>>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
>>> bits:3072 flags:0
>>> Apr 26 11:19:43 secondary kernel: [1809213.231315] block drbd1:
>>> uuid_compare()=100 by rule 90
>>> Apr 26 11:19:43 secondary kernel: [1809213.231322] block drbd1: helper
>>> command: /bin/true initial-split-brain minor-1
>>> Apr 26 11:19:43 secondary kernel: [1809213.232460] block drbd1: helper
>>> command: /bin/true initial-split-brain minor-1 exit code 0 (0x0)
>>> Apr 26 11:19:43 secondary kernel: [1809213.232494] block drbd1: helper
>>> command: /bin/true split-brain minor-1
>>> Apr 26 11:19:43 secondary kernel: [1809213.233512] block drbd1: helper
>>> command: /bin/true split-brain minor-1 exit code 0 (0x0)
>>> Apr 26 11:19:43 secondary kernel: [1809213.233539] d-con resource1:
>>> conn( WFReportParams -> Disconnecting )
>>> Apr 26 11:19:43 secondary kernel: [1809213.233574] d-con resource1:
>>> asender terminated
>>> Apr 26 11:19:43 secondary kernel: [1809213.233579] d-con resource1:
>>> Terminating asender thread
>>> Apr 26 11:19:43 secondary kernel: [1809213.233631] d-con resource1:
>>> Connection closed
>>> Apr 26 11:19:43 secondary kernel: [1809213.233662] d-con resource1:
>>> conn( Disconnecting -> StandAlone )
>>> Apr 26 11:19:43 secondary kernel: [1809213.233667] d-con resource1:
>>> receiver terminated
>>> Apr 26 11:19:43 secondary kernel: [1809213.233672] d-con resource1:
>>> Terminating receiver thread
>>>
>>>
>>> What am I doing wrong? Is there a requirement to wait for a
>>> sync/propagation of properties/random amount of time before promoting the
>>> secondary to primary? Is this a bug?
>>>
>>> Thanks,
>>> Thomas
>>>
>>> --
>>> Thomas Thrainer | Software Engineer | thomasth at google.com |
>>>
>>>  Google Germany GmbH
>>> Dienerstr. 12
>>> 80331 München
>>>
>>> Registergericht und -nummer: Hamburg, HRB 86891
>>> Sitz der Gesellschaft: Hamburg
>>> Geschäftsführer: Graham Law, Katherine Stephens
>>>
>>> _______________________________________________
>>> drbd-user mailing list
>>> drbd-user at lists.linbit.com
>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>
>>>
>>
>
>
> --
> Thomas Thrainer | Software Engineer | thomasth at google.com |
>
>  Google Germany GmbH
> Dienerstr. 12
> 80331 München
>
> Registergericht und -nummer: Hamburg, HRB 86891
> Sitz der Gesellschaft: Hamburg
> Geschäftsführer: Graham Law, Katherine Stephens
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130426/a1bfb6f4/attachment.htm>


More information about the drbd-user mailing list