[DRBD-user] Enabling two primaries break connection

Luca Fornasari luca.fornasari at gmail.com
Tue Apr 30 11:45:06 CEST 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi Thomas,

>From Ganeti point of view it makes perfectly sense to use drbdadm directly
(I'm not familiar using it this way) and you are right drbdadm is "just" a
wrapper to drbdsetup. Anyway you should give a try at "drbdadm -d
--allow-two-primaries=yes r0". Please note the -d switch that just do a
dry-run and print out the relative drbdsetup invocation.

But look at your log file:
Apr 26 11:19:42 primary kernel: [181721.669870] block drbd1: peer(
Secondary -> Primary )
Apr 26 11:19:42 primary kernel: [181722.057848] d-con resource1: sock was
shut down by peer
Apr 26 11:19:42 primary kernel: [181722.057872] d-con resource1: peer(
Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate ->
DUnknown )
Apr 26 11:19:42 primary kernel: [181722.057881] d-con resource1: short read
(expected size 16)

Soon after resource1 changed his state from secondary to primary on the
remote node the network socket was closed by the remote node itself.
What are the remote side logs saying?

I'm not aware of any bug on 8.4.2 related to dual primary but I can be
wrong. Altough 8.4.3 is out.

Cheers,
Luca



On Mon, Apr 29, 2013 at 9:42 AM, Thomas Thrainer <thomasth at google.com>wrote:

> Hi Luca,
>
> we (Ganeti) use drbdsetup directly because it's the more "programmatic"
> way of configuring DRBD. We don't want to manage the configuration file,
> but just reconfigure machines again as they join a cluster. You could think
> of Ganeti as managing the DRBD configuration in a different way.
>
> Anyway, drbdadm is just a wrapper around drbdsetup. In particular, "drbdadm
> --allow-two-primaries=yes r0" just calls "drbdsetup net-options
> ipv4:<local_ip>:11001 ipv4:<remote_ip>:11001 --allow-two-primaries=yes"
> (according to
> http://git.drbd.org/gitweb.cgi?p=drbd-8.4.git;a=blob;f=user/drbdadm_main.c;h=8179625f7bef7172c07974dd63bf76ddb10b0d60;hb=HEAD#l1668).
> Can adding "--protocol=C" really make a difference? Especially if
> --allow-two-primaries only works with protocol C anyways?
> Additionally, the actual problem occurs as soon as I issue a "drbdsetup
> primary 0" (which is what "drbdadm primary r0" calls as well).
>
> As I stated, I tried to call the above command on one and on both sides
> simultaneously, with the same outcome.
>
> So, am I hitting a DRBD bug here? Or do you have other ideas of what I
> could do wrong?
>
> Cheers,
> Thomas
>
>
> On Fri, Apr 26, 2013 at 5:04 PM, Luca Fornasari <luca.fornasari at gmail.com>wrote:
>
>> Hi Thomas,
>>
>> In line reply below
>>
>> On Fri, Apr 26, 2013 at 4:14 PM, Thomas Thrainer <thomasth at google.com>wrote:
>>
>>> Hi Luca,
>>>
>>> (CC'd drbd-user, I guess that might be helpful for others as well)
>>>
>>
>> Just reply to the list; I'm subscribed ;)
>>
>>
>>>  We're not using drbdadm but drbdsetup directly.
>>>
>>> I tried `drbdsetup net-options ipv4:<local_ip>:11001
>>> ipv4:<remote_ip>:11001 --protocol C --allow-two-
>>> primaries=yes` (i.e. I stripped the repeated options), but the result is
>>> still the same.
>>>
>>
>> I' not 100% sure but I think that repeating ipv4:<local_ip>:local_port
>> ipv4:<remote_ip>:remote_port restart the connection; during off-load time
>> that happens fast enough while during high-load fails.
>> Just try to use "drbdadm --allow-two-primaries=yes r0" on one node only.
>> Do you have a good reason to use drbdsetup directly?
>>
>> Cheers,
>> Luca
>>
>>
>>> Note however, that the problem occurs only every now and then, and
>>> primarily when there is load on the disk(s).
>>>
>>> BTW, I actually do set two disks to dual-primary mode at the same time
>>> (using different connections/resources tough), and one disk normally works
>>> while the other fails (is't not deterministic which of disk fails).
>>>
>>> Cheers,
>>> Thomas
>>>
>>>
>>> On Fri, Apr 26, 2013 at 3:58 PM, Luca Fornasari <
>>> luca.fornasari at gmail.com> wrote:
>>>
>>>> Hi Thomas,
>>>>
>>>> Just execute the following on one node only:
>>>>
>>>> drbdadm net-options --protocol=C --allow-two-primaries r0
>>>>
>>>> I guess that the command you are issuing just try to restart an already
>>>> running resource.
>>>>
>>>> Cheers,
>>>> Luca
>>>>
>>>>
>>>> On Fri, Apr 26, 2013 at 2:27 PM, Thomas Thrainer <thomasth at google.com>wrote:
>>>>
>>>>>  Hi,
>>>>>
>>>>> I've encountered a problem with DRBD 8.4.2 when I try to enable
>>>>> --allow-two-primaries on the fly and immediately promoting the secondary to
>>>>> primary afterwards.
>>>>> The problem doesn't occur always, and it seems like it is more likely
>>>>> to happen when there is more load on the device.
>>>>>
>>>>> The exact command sequence is as follows:
>>>>>
>>>>> Executed on primary and secondary node simultaneously (but also
>>>>> happens if only executed on secondary):
>>>>>
>>>>> drbdsetup net-options ipv4:<loc_ip>:11001 ipv4:<rem_ip>:11001
>>>>> --protocol C --after-sb-0pri discard-zero-changes --after-sb-1pri consensus
>>>>> --allow-two-primaries=yes --cram-hmac-alg md5 --shared-secret <secret>
>>>>> drbdsetup primary 1
>>>>>
>>>>> BTW, the only options which differs in regard to the previously issued
>>>>> drbdsetup connect command is --allow-two-primaries. The rest (protocol,
>>>>> secret, etc.) are just repeated.
>>>>>
>>>>> The outcome is that both nodes end up in the StandAlone state.
>>>>>
>>>>> Their respective kernel log messages are:
>>>>>
>>>>> (Old) primary:
>>>>> Apr 26 11:19:42 primary kernel: [181721.646750] block drbd0: peer(
>>>>> Secondary -> Primary )
>>>>> Apr 26 11:19:42 primary kernel: [181721.669870] block drbd1: peer(
>>>>> Secondary -> Primary )
>>>>> Apr 26 11:19:42 primary kernel: [181722.057848] d-con resource1: sock
>>>>> was shut down by peer
>>>>> Apr 26 11:19:42 primary kernel: [181722.057872] d-con resource1: peer(
>>>>> Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate ->
>>>>> DUnknown )
>>>>> Apr 26 11:19:42 primary kernel: [181722.057881] d-con resource1: short
>>>>> read (expected size 16)
>>>>> Apr 26 11:19:42 primary kernel: [181722.057914] block drbd1: new
>>>>> current UUID
>>>>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
>>>>> Apr 26 11:19:42 primary kernel: [181722.057964] d-con resource1:
>>>>> asender terminated
>>>>> Apr 26 11:19:42 primary kernel: [181722.057977] d-con resource1:
>>>>> Terminating asender thread
>>>>> Apr 26 11:19:42 primary kernel: [181722.058485] d-con resource1:
>>>>> Connection closed
>>>>> Apr 26 11:19:42 primary kernel: [181722.067019] d-con resource1: conn(
>>>>> BrokenPipe -> Unconnected )
>>>>> Apr 26 11:19:42 primary kernel: [181722.067027] d-con resource1:
>>>>> receiver terminated
>>>>> Apr 26 11:19:42 primary kernel: [181722.067032] d-con resource1:
>>>>> Restarting receiver thread
>>>>> Apr 26 11:19:42 primary kernel: [181722.067036] d-con resource1:
>>>>> receiver (re)started
>>>>> Apr 26 11:19:42 primary kernel: [181722.067045] d-con resource1: conn(
>>>>> Unconnected -> WFConnection )
>>>>> Apr 26 11:19:43 primary kernel: [181722.558370] d-con resource1:
>>>>> Handshake successful: Agreed network protocol version 101
>>>>> Apr 26 11:19:43 primary kernel: [181722.558702] d-con resource1: Peer
>>>>> authenticated using 16 bytes HMAC
>>>>> Apr 26 11:19:43 primary kernel: [181722.558747] d-con resource1: conn(
>>>>> WFConnection -> WFReportParams )
>>>>> Apr 26 11:19:43 primary kernel: [181722.558754] d-con resource1:
>>>>> Starting asender thread (from drbd_r_resource [2039])
>>>>> Apr 26 11:19:43 primary kernel: [181722.560436] block drbd1:
>>>>> drbd_sync_handshake:
>>>>> Apr 26 11:19:43 primary kernel: [181722.560445] block drbd1: self
>>>>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
>>>>> bits:3072 flags:0
>>>>> Apr 26 11:19:43 primary kernel: [181722.560454] block drbd1: peer
>>>>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0
>>>>> flags:0
>>>>> Apr 26 11:19:43 primary kernel: [181722.560466] block drbd1:
>>>>> uuid_compare()=100 by rule 90
>>>>> Apr 26 11:19:43 primary kernel: [181722.560474] block drbd1: helper
>>>>> command: /bin/true initial-split-brain minor-1
>>>>> Apr 26 11:19:43 primary kernel: [181722.565127] d-con resource1: conn(
>>>>> WFReportParams -> NetworkFailure )
>>>>> Apr 26 11:19:43 primary kernel: [181722.565134] d-con resource1:
>>>>> asender terminated
>>>>> Apr 26 11:19:43 primary kernel: [181722.565138] d-con resource1:
>>>>> Terminating asender thread
>>>>> Apr 26 11:19:43 primary kernel: [181722.570459] block drbd1: helper
>>>>> command: /bin/true initial-split-brain minor-1 exit code 0 (0x0)
>>>>> Apr 26 11:19:43 primary kernel: [181722.570488] block drbd1: helper
>>>>> command: /bin/true split-brain minor-1
>>>>> Apr 26 11:19:43 primary kernel: [181722.583047] block drbd1: helper
>>>>> command: /bin/true split-brain minor-1 exit code 0 (0x0)
>>>>> Apr 26 11:19:43 primary kernel: [181722.583073] d-con resource1: conn(
>>>>> NetworkFailure -> Disconnecting )
>>>>> Apr 26 11:19:43 primary kernel: [181722.583143] d-con resource1:
>>>>> Connection closed
>>>>> Apr 26 11:19:43 primary kernel: [181722.586237] d-con resource1: conn(
>>>>> Disconnecting -> StandAlone )
>>>>> Apr 26 11:19:43 primary kernel: [181722.586245] d-con resource1:
>>>>> receiver terminated
>>>>> Apr 26 11:19:43 primary kernel: [181722.586249] d-con resource1:
>>>>> Terminating receiver thread
>>>>> Apr 26 11:19:46 primary kernel: [181726.054479] br974: port
>>>>> 2(vif126.0) entering forwarding state
>>>>> Apr 26 11:19:46 primary kernel: [181726.058824] br974: port
>>>>> 2(vif126.0) entering disabled state
>>>>>
>>>>> (Old) secondary:
>>>>> Apr 26 11:19:42 secondary kernel: [1809212.315376] block drbd0: role(
>>>>> Secondary -> Primary )
>>>>> Apr 26 11:19:42 secondary kernel: [1809212.338517] block drbd1: role(
>>>>> Secondary -> Primary )
>>>>> Apr 26 11:19:42 secondary kernel: [1809212.726247] d-con resource1:
>>>>> peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk(
>>>>> UpToDate -> DUnknown )
>>>>> Apr 26 11:19:42 secondary kernel: [1809212.726278] block drbd1: new
>>>>> current UUID
>>>>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF
>>>>> Apr 26 11:19:42 secondary kernel: [1809212.726310] d-con resource1:
>>>>> asender terminated
>>>>> Apr 26 11:19:42 secondary kernel: [1809212.726340] d-con resource1:
>>>>> Terminating asender thread
>>>>> Apr 26 11:19:42 secondary kernel: [1809212.726719] d-con resource1:
>>>>> Connection closed
>>>>> Apr 26 11:19:42 secondary kernel: [1809212.726749] d-con resource1:
>>>>> conn( ProtocolError -> Unconnected )
>>>>> Apr 26 11:19:42 secondary kernel: [1809212.726755] d-con resource1:
>>>>> receiver terminated
>>>>> Apr 26 11:19:42 secondary kernel: [1809212.726759] d-con resource1:
>>>>> Restarting receiver thread
>>>>> Apr 26 11:19:42 secondary kernel: [1809212.726763] d-con resource1:
>>>>> receiver (re)started
>>>>> Apr 26 11:19:42 secondary kernel: [1809212.726771] d-con resource1:
>>>>> conn( Unconnected -> WFConnection )
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.226864] d-con resource1:
>>>>> Handshake successful: Agreed network protocol version 101
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.227199] d-con resource1:
>>>>> Peer authenticated using 16 bytes HMAC
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.227238] d-con resource1:
>>>>> conn( WFConnection -> WFReportParams )
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.227245] d-con resource1:
>>>>> Starting asender thread (from drbd_r_resource [20607])
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.231289] block drbd1:
>>>>> drbd_sync_handshake:
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.231297] block drbd1: self
>>>>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0
>>>>> flags:0
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.231306] block drbd1: peer
>>>>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
>>>>> bits:3072 flags:0
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.231315] block drbd1:
>>>>> uuid_compare()=100 by rule 90
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.231322] block drbd1: helper
>>>>> command: /bin/true initial-split-brain minor-1
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.232460] block drbd1: helper
>>>>> command: /bin/true initial-split-brain minor-1 exit code 0 (0x0)
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.232494] block drbd1: helper
>>>>> command: /bin/true split-brain minor-1
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.233512] block drbd1: helper
>>>>> command: /bin/true split-brain minor-1 exit code 0 (0x0)
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.233539] d-con resource1:
>>>>> conn( WFReportParams -> Disconnecting )
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.233574] d-con resource1:
>>>>> asender terminated
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.233579] d-con resource1:
>>>>> Terminating asender thread
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.233631] d-con resource1:
>>>>> Connection closed
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.233662] d-con resource1:
>>>>> conn( Disconnecting -> StandAlone )
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.233667] d-con resource1:
>>>>> receiver terminated
>>>>> Apr 26 11:19:43 secondary kernel: [1809213.233672] d-con resource1:
>>>>> Terminating receiver thread
>>>>>
>>>>>
>>>>> What am I doing wrong? Is there a requirement to wait for a
>>>>> sync/propagation of properties/random amount of time before promoting the
>>>>> secondary to primary? Is this a bug?
>>>>>
>>>>> Thanks,
>>>>> Thomas
>>>>>
>>>>> --
>>>>> Thomas Thrainer | Software Engineer | thomasth at google.com |
>>>>>
>>>>>  Google Germany GmbH
>>>>> Dienerstr. 12
>>>>> 80331 München
>>>>>
>>>>> Registergericht und -nummer: Hamburg, HRB 86891
>>>>> Sitz der Gesellschaft: Hamburg
>>>>> Geschäftsführer: Graham Law, Katherine Stephens
>>>>>
>>>>> _______________________________________________
>>>>> drbd-user mailing list
>>>>> drbd-user at lists.linbit.com
>>>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thomas Thrainer | Software Engineer | thomasth at google.com |
>>>
>>>  Google Germany GmbH
>>> Dienerstr. 12
>>> 80331 München
>>>
>>> Registergericht und -nummer: Hamburg, HRB 86891
>>> Sitz der Gesellschaft: Hamburg
>>> Geschäftsführer: Graham Law, Katherine Stephens
>>>
>>
>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
>>
>
>
> --
> Thomas Thrainer | Software Engineer | thomasth at google.com |
>
>  Google Germany GmbH
> Dienerstr. 12
> 80331 München
>
> Registergericht und -nummer: Hamburg, HRB 86891
> Sitz der Gesellschaft: Hamburg
> Geschäftsführer: Graham Law, Katherine Stephens
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130430/620cf1d0/attachment.htm>


More information about the drbd-user mailing list