Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Thomas, In line reply below On Fri, Apr 26, 2013 at 4:14 PM, Thomas Thrainer <thomasth at google.com>wrote: > Hi Luca, > > (CC'd drbd-user, I guess that might be helpful for others as well) > Just reply to the list; I'm subscribed ;) > We're not using drbdadm but drbdsetup directly. > > I tried `drbdsetup net-options ipv4:<local_ip>:11001 > ipv4:<remote_ip>:11001 --protocol C --allow-two- > primaries=yes` (i.e. I stripped the repeated options), but the result is > still the same. > I' not 100% sure but I think that repeating ipv4:<local_ip>:local_port ipv4:<remote_ip>:remote_port restart the connection; during off-load time that happens fast enough while during high-load fails. Just try to use "drbdadm --allow-two-primaries=yes r0" on one node only. Do you have a good reason to use drbdsetup directly? Cheers, Luca > Note however, that the problem occurs only every now and then, and > primarily when there is load on the disk(s). > > BTW, I actually do set two disks to dual-primary mode at the same time > (using different connections/resources tough), and one disk normally works > while the other fails (is't not deterministic which of disk fails). > > Cheers, > Thomas > > > On Fri, Apr 26, 2013 at 3:58 PM, Luca Fornasari <luca.fornasari at gmail.com>wrote: > >> Hi Thomas, >> >> Just execute the following on one node only: >> >> drbdadm net-options --protocol=C --allow-two-primaries r0 >> >> I guess that the command you are issuing just try to restart an already >> running resource. >> >> Cheers, >> Luca >> >> >> On Fri, Apr 26, 2013 at 2:27 PM, Thomas Thrainer <thomasth at google.com>wrote: >> >>> Hi, >>> >>> I've encountered a problem with DRBD 8.4.2 when I try to enable >>> --allow-two-primaries on the fly and immediately promoting the secondary to >>> primary afterwards. >>> The problem doesn't occur always, and it seems like it is more likely to >>> happen when there is more load on the device. >>> >>> The exact command sequence is as follows: >>> >>> Executed on primary and secondary node simultaneously (but also happens >>> if only executed on secondary): >>> >>> drbdsetup net-options ipv4:<loc_ip>:11001 ipv4:<rem_ip>:11001 --protocol >>> C --after-sb-0pri discard-zero-changes --after-sb-1pri consensus >>> --allow-two-primaries=yes --cram-hmac-alg md5 --shared-secret <secret> >>> drbdsetup primary 1 >>> >>> BTW, the only options which differs in regard to the previously issued >>> drbdsetup connect command is --allow-two-primaries. The rest (protocol, >>> secret, etc.) are just repeated. >>> >>> The outcome is that both nodes end up in the StandAlone state. >>> >>> Their respective kernel log messages are: >>> >>> (Old) primary: >>> Apr 26 11:19:42 primary kernel: [181721.646750] block drbd0: peer( >>> Secondary -> Primary ) >>> Apr 26 11:19:42 primary kernel: [181721.669870] block drbd1: peer( >>> Secondary -> Primary ) >>> Apr 26 11:19:42 primary kernel: [181722.057848] d-con resource1: sock >>> was shut down by peer >>> Apr 26 11:19:42 primary kernel: [181722.057872] d-con resource1: peer( >>> Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> >>> DUnknown ) >>> Apr 26 11:19:42 primary kernel: [181722.057881] d-con resource1: short >>> read (expected size 16) >>> Apr 26 11:19:42 primary kernel: [181722.057914] block drbd1: new current >>> UUID DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF >>> Apr 26 11:19:42 primary kernel: [181722.057964] d-con resource1: asender >>> terminated >>> Apr 26 11:19:42 primary kernel: [181722.057977] d-con resource1: >>> Terminating asender thread >>> Apr 26 11:19:42 primary kernel: [181722.058485] d-con resource1: >>> Connection closed >>> Apr 26 11:19:42 primary kernel: [181722.067019] d-con resource1: conn( >>> BrokenPipe -> Unconnected ) >>> Apr 26 11:19:42 primary kernel: [181722.067027] d-con resource1: >>> receiver terminated >>> Apr 26 11:19:42 primary kernel: [181722.067032] d-con resource1: >>> Restarting receiver thread >>> Apr 26 11:19:42 primary kernel: [181722.067036] d-con resource1: >>> receiver (re)started >>> Apr 26 11:19:42 primary kernel: [181722.067045] d-con resource1: conn( >>> Unconnected -> WFConnection ) >>> Apr 26 11:19:43 primary kernel: [181722.558370] d-con resource1: >>> Handshake successful: Agreed network protocol version 101 >>> Apr 26 11:19:43 primary kernel: [181722.558702] d-con resource1: Peer >>> authenticated using 16 bytes HMAC >>> Apr 26 11:19:43 primary kernel: [181722.558747] d-con resource1: conn( >>> WFConnection -> WFReportParams ) >>> Apr 26 11:19:43 primary kernel: [181722.558754] d-con resource1: >>> Starting asender thread (from drbd_r_resource [2039]) >>> Apr 26 11:19:43 primary kernel: [181722.560436] block drbd1: >>> drbd_sync_handshake: >>> Apr 26 11:19:43 primary kernel: [181722.560445] block drbd1: self >>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF >>> bits:3072 flags:0 >>> Apr 26 11:19:43 primary kernel: [181722.560454] block drbd1: peer >>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0 >>> flags:0 >>> Apr 26 11:19:43 primary kernel: [181722.560466] block drbd1: >>> uuid_compare()=100 by rule 90 >>> Apr 26 11:19:43 primary kernel: [181722.560474] block drbd1: helper >>> command: /bin/true initial-split-brain minor-1 >>> Apr 26 11:19:43 primary kernel: [181722.565127] d-con resource1: conn( >>> WFReportParams -> NetworkFailure ) >>> Apr 26 11:19:43 primary kernel: [181722.565134] d-con resource1: asender >>> terminated >>> Apr 26 11:19:43 primary kernel: [181722.565138] d-con resource1: >>> Terminating asender thread >>> Apr 26 11:19:43 primary kernel: [181722.570459] block drbd1: helper >>> command: /bin/true initial-split-brain minor-1 exit code 0 (0x0) >>> Apr 26 11:19:43 primary kernel: [181722.570488] block drbd1: helper >>> command: /bin/true split-brain minor-1 >>> Apr 26 11:19:43 primary kernel: [181722.583047] block drbd1: helper >>> command: /bin/true split-brain minor-1 exit code 0 (0x0) >>> Apr 26 11:19:43 primary kernel: [181722.583073] d-con resource1: conn( >>> NetworkFailure -> Disconnecting ) >>> Apr 26 11:19:43 primary kernel: [181722.583143] d-con resource1: >>> Connection closed >>> Apr 26 11:19:43 primary kernel: [181722.586237] d-con resource1: conn( >>> Disconnecting -> StandAlone ) >>> Apr 26 11:19:43 primary kernel: [181722.586245] d-con resource1: >>> receiver terminated >>> Apr 26 11:19:43 primary kernel: [181722.586249] d-con resource1: >>> Terminating receiver thread >>> Apr 26 11:19:46 primary kernel: [181726.054479] br974: port 2(vif126.0) >>> entering forwarding state >>> Apr 26 11:19:46 primary kernel: [181726.058824] br974: port 2(vif126.0) >>> entering disabled state >>> >>> (Old) secondary: >>> Apr 26 11:19:42 secondary kernel: [1809212.315376] block drbd0: role( >>> Secondary -> Primary ) >>> Apr 26 11:19:42 secondary kernel: [1809212.338517] block drbd1: role( >>> Secondary -> Primary ) >>> Apr 26 11:19:42 secondary kernel: [1809212.726247] d-con resource1: >>> peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( >>> UpToDate -> DUnknown ) >>> Apr 26 11:19:42 secondary kernel: [1809212.726278] block drbd1: new >>> current UUID >>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF >>> Apr 26 11:19:42 secondary kernel: [1809212.726310] d-con resource1: >>> asender terminated >>> Apr 26 11:19:42 secondary kernel: [1809212.726340] d-con resource1: >>> Terminating asender thread >>> Apr 26 11:19:42 secondary kernel: [1809212.726719] d-con resource1: >>> Connection closed >>> Apr 26 11:19:42 secondary kernel: [1809212.726749] d-con resource1: >>> conn( ProtocolError -> Unconnected ) >>> Apr 26 11:19:42 secondary kernel: [1809212.726755] d-con resource1: >>> receiver terminated >>> Apr 26 11:19:42 secondary kernel: [1809212.726759] d-con resource1: >>> Restarting receiver thread >>> Apr 26 11:19:42 secondary kernel: [1809212.726763] d-con resource1: >>> receiver (re)started >>> Apr 26 11:19:42 secondary kernel: [1809212.726771] d-con resource1: >>> conn( Unconnected -> WFConnection ) >>> Apr 26 11:19:43 secondary kernel: [1809213.226864] d-con resource1: >>> Handshake successful: Agreed network protocol version 101 >>> Apr 26 11:19:43 secondary kernel: [1809213.227199] d-con resource1: Peer >>> authenticated using 16 bytes HMAC >>> Apr 26 11:19:43 secondary kernel: [1809213.227238] d-con resource1: >>> conn( WFConnection -> WFReportParams ) >>> Apr 26 11:19:43 secondary kernel: [1809213.227245] d-con resource1: >>> Starting asender thread (from drbd_r_resource [20607]) >>> Apr 26 11:19:43 secondary kernel: [1809213.231289] block drbd1: >>> drbd_sync_handshake: >>> Apr 26 11:19:43 secondary kernel: [1809213.231297] block drbd1: self >>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0 >>> flags:0 >>> Apr 26 11:19:43 secondary kernel: [1809213.231306] block drbd1: peer >>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF >>> bits:3072 flags:0 >>> Apr 26 11:19:43 secondary kernel: [1809213.231315] block drbd1: >>> uuid_compare()=100 by rule 90 >>> Apr 26 11:19:43 secondary kernel: [1809213.231322] block drbd1: helper >>> command: /bin/true initial-split-brain minor-1 >>> Apr 26 11:19:43 secondary kernel: [1809213.232460] block drbd1: helper >>> command: /bin/true initial-split-brain minor-1 exit code 0 (0x0) >>> Apr 26 11:19:43 secondary kernel: [1809213.232494] block drbd1: helper >>> command: /bin/true split-brain minor-1 >>> Apr 26 11:19:43 secondary kernel: [1809213.233512] block drbd1: helper >>> command: /bin/true split-brain minor-1 exit code 0 (0x0) >>> Apr 26 11:19:43 secondary kernel: [1809213.233539] d-con resource1: >>> conn( WFReportParams -> Disconnecting ) >>> Apr 26 11:19:43 secondary kernel: [1809213.233574] d-con resource1: >>> asender terminated >>> Apr 26 11:19:43 secondary kernel: [1809213.233579] d-con resource1: >>> Terminating asender thread >>> Apr 26 11:19:43 secondary kernel: [1809213.233631] d-con resource1: >>> Connection closed >>> Apr 26 11:19:43 secondary kernel: [1809213.233662] d-con resource1: >>> conn( Disconnecting -> StandAlone ) >>> Apr 26 11:19:43 secondary kernel: [1809213.233667] d-con resource1: >>> receiver terminated >>> Apr 26 11:19:43 secondary kernel: [1809213.233672] d-con resource1: >>> Terminating receiver thread >>> >>> >>> What am I doing wrong? Is there a requirement to wait for a >>> sync/propagation of properties/random amount of time before promoting the >>> secondary to primary? Is this a bug? >>> >>> Thanks, >>> Thomas >>> >>> -- >>> Thomas Thrainer | Software Engineer | thomasth at google.com | >>> >>> Google Germany GmbH >>> Dienerstr. 12 >>> 80331 München >>> >>> Registergericht und -nummer: Hamburg, HRB 86891 >>> Sitz der Gesellschaft: Hamburg >>> Geschäftsführer: Graham Law, Katherine Stephens >>> >>> _______________________________________________ >>> drbd-user mailing list >>> drbd-user at lists.linbit.com >>> http://lists.linbit.com/mailman/listinfo/drbd-user >>> >>> >> > > > -- > Thomas Thrainer | Software Engineer | thomasth at google.com | > > Google Germany GmbH > Dienerstr. 12 > 80331 München > > Registergericht und -nummer: Hamburg, HRB 86891 > Sitz der Gesellschaft: Hamburg > Geschäftsführer: Graham Law, Katherine Stephens > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130426/a1bfb6f4/attachment.htm>