Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I've encountered a problem with DRBD 8.4.2 when I try to enable --allow-two-primaries on the fly and immediately promoting the secondary to primary afterwards. The problem doesn't occur always, and it seems like it is more likely to happen when there is more load on the device. The exact command sequence is as follows: Executed on primary and secondary node simultaneously (but also happens if only executed on secondary): drbdsetup net-options ipv4:<loc_ip>:11001 ipv4:<rem_ip>:11001 --protocol C --after-sb-0pri discard-zero-changes --after-sb-1pri consensus --allow-two-primaries=yes --cram-hmac-alg md5 --shared-secret <secret> drbdsetup primary 1 BTW, the only options which differs in regard to the previously issued drbdsetup connect command is --allow-two-primaries. The rest (protocol, secret, etc.) are just repeated. The outcome is that both nodes end up in the StandAlone state. Their respective kernel log messages are: (Old) primary: Apr 26 11:19:42 primary kernel: [181721.646750] block drbd0: peer( Secondary -> Primary ) Apr 26 11:19:42 primary kernel: [181721.669870] block drbd1: peer( Secondary -> Primary ) Apr 26 11:19:42 primary kernel: [181722.057848] d-con resource1: sock was shut down by peer Apr 26 11:19:42 primary kernel: [181722.057872] d-con resource1: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Apr 26 11:19:42 primary kernel: [181722.057881] d-con resource1: short read (expected size 16) Apr 26 11:19:42 primary kernel: [181722.057914] block drbd1: new current UUID DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF Apr 26 11:19:42 primary kernel: [181722.057964] d-con resource1: asender terminated Apr 26 11:19:42 primary kernel: [181722.057977] d-con resource1: Terminating asender thread Apr 26 11:19:42 primary kernel: [181722.058485] d-con resource1: Connection closed Apr 26 11:19:42 primary kernel: [181722.067019] d-con resource1: conn( BrokenPipe -> Unconnected ) Apr 26 11:19:42 primary kernel: [181722.067027] d-con resource1: receiver terminated Apr 26 11:19:42 primary kernel: [181722.067032] d-con resource1: Restarting receiver thread Apr 26 11:19:42 primary kernel: [181722.067036] d-con resource1: receiver (re)started Apr 26 11:19:42 primary kernel: [181722.067045] d-con resource1: conn( Unconnected -> WFConnection ) Apr 26 11:19:43 primary kernel: [181722.558370] d-con resource1: Handshake successful: Agreed network protocol version 101 Apr 26 11:19:43 primary kernel: [181722.558702] d-con resource1: Peer authenticated using 16 bytes HMAC Apr 26 11:19:43 primary kernel: [181722.558747] d-con resource1: conn( WFConnection -> WFReportParams ) Apr 26 11:19:43 primary kernel: [181722.558754] d-con resource1: Starting asender thread (from drbd_r_resource [2039]) Apr 26 11:19:43 primary kernel: [181722.560436] block drbd1: drbd_sync_handshake: Apr 26 11:19:43 primary kernel: [181722.560445] block drbd1: self DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF bits:3072 flags:0 Apr 26 11:19:43 primary kernel: [181722.560454] block drbd1: peer 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0 flags:0 Apr 26 11:19:43 primary kernel: [181722.560466] block drbd1: uuid_compare()=100 by rule 90 Apr 26 11:19:43 primary kernel: [181722.560474] block drbd1: helper command: /bin/true initial-split-brain minor-1 Apr 26 11:19:43 primary kernel: [181722.565127] d-con resource1: conn( WFReportParams -> NetworkFailure ) Apr 26 11:19:43 primary kernel: [181722.565134] d-con resource1: asender terminated Apr 26 11:19:43 primary kernel: [181722.565138] d-con resource1: Terminating asender thread Apr 26 11:19:43 primary kernel: [181722.570459] block drbd1: helper command: /bin/true initial-split-brain minor-1 exit code 0 (0x0) Apr 26 11:19:43 primary kernel: [181722.570488] block drbd1: helper command: /bin/true split-brain minor-1 Apr 26 11:19:43 primary kernel: [181722.583047] block drbd1: helper command: /bin/true split-brain minor-1 exit code 0 (0x0) Apr 26 11:19:43 primary kernel: [181722.583073] d-con resource1: conn( NetworkFailure -> Disconnecting ) Apr 26 11:19:43 primary kernel: [181722.583143] d-con resource1: Connection closed Apr 26 11:19:43 primary kernel: [181722.586237] d-con resource1: conn( Disconnecting -> StandAlone ) Apr 26 11:19:43 primary kernel: [181722.586245] d-con resource1: receiver terminated Apr 26 11:19:43 primary kernel: [181722.586249] d-con resource1: Terminating receiver thread Apr 26 11:19:46 primary kernel: [181726.054479] br974: port 2(vif126.0) entering forwarding state Apr 26 11:19:46 primary kernel: [181726.058824] br974: port 2(vif126.0) entering disabled state (Old) secondary: Apr 26 11:19:42 secondary kernel: [1809212.315376] block drbd0: role( Secondary -> Primary ) Apr 26 11:19:42 secondary kernel: [1809212.338517] block drbd1: role( Secondary -> Primary ) Apr 26 11:19:42 secondary kernel: [1809212.726247] d-con resource1: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) Apr 26 11:19:42 secondary kernel: [1809212.726278] block drbd1: new current UUID 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF Apr 26 11:19:42 secondary kernel: [1809212.726310] d-con resource1: asender terminated Apr 26 11:19:42 secondary kernel: [1809212.726340] d-con resource1: Terminating asender thread Apr 26 11:19:42 secondary kernel: [1809212.726719] d-con resource1: Connection closed Apr 26 11:19:42 secondary kernel: [1809212.726749] d-con resource1: conn( ProtocolError -> Unconnected ) Apr 26 11:19:42 secondary kernel: [1809212.726755] d-con resource1: receiver terminated Apr 26 11:19:42 secondary kernel: [1809212.726759] d-con resource1: Restarting receiver thread Apr 26 11:19:42 secondary kernel: [1809212.726763] d-con resource1: receiver (re)started Apr 26 11:19:42 secondary kernel: [1809212.726771] d-con resource1: conn( Unconnected -> WFConnection ) Apr 26 11:19:43 secondary kernel: [1809213.226864] d-con resource1: Handshake successful: Agreed network protocol version 101 Apr 26 11:19:43 secondary kernel: [1809213.227199] d-con resource1: Peer authenticated using 16 bytes HMAC Apr 26 11:19:43 secondary kernel: [1809213.227238] d-con resource1: conn( WFConnection -> WFReportParams ) Apr 26 11:19:43 secondary kernel: [1809213.227245] d-con resource1: Starting asender thread (from drbd_r_resource [20607]) Apr 26 11:19:43 secondary kernel: [1809213.231289] block drbd1: drbd_sync_handshake: Apr 26 11:19:43 secondary kernel: [1809213.231297] block drbd1: self 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0 flags:0 Apr 26 11:19:43 secondary kernel: [1809213.231306] block drbd1: peer DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF bits:3072 flags:0 Apr 26 11:19:43 secondary kernel: [1809213.231315] block drbd1: uuid_compare()=100 by rule 90 Apr 26 11:19:43 secondary kernel: [1809213.231322] block drbd1: helper command: /bin/true initial-split-brain minor-1 Apr 26 11:19:43 secondary kernel: [1809213.232460] block drbd1: helper command: /bin/true initial-split-brain minor-1 exit code 0 (0x0) Apr 26 11:19:43 secondary kernel: [1809213.232494] block drbd1: helper command: /bin/true split-brain minor-1 Apr 26 11:19:43 secondary kernel: [1809213.233512] block drbd1: helper command: /bin/true split-brain minor-1 exit code 0 (0x0) Apr 26 11:19:43 secondary kernel: [1809213.233539] d-con resource1: conn( WFReportParams -> Disconnecting ) Apr 26 11:19:43 secondary kernel: [1809213.233574] d-con resource1: asender terminated Apr 26 11:19:43 secondary kernel: [1809213.233579] d-con resource1: Terminating asender thread Apr 26 11:19:43 secondary kernel: [1809213.233631] d-con resource1: Connection closed Apr 26 11:19:43 secondary kernel: [1809213.233662] d-con resource1: conn( Disconnecting -> StandAlone ) Apr 26 11:19:43 secondary kernel: [1809213.233667] d-con resource1: receiver terminated Apr 26 11:19:43 secondary kernel: [1809213.233672] d-con resource1: Terminating receiver thread What am I doing wrong? Is there a requirement to wait for a sync/propagation of properties/random amount of time before promoting the secondary to primary? Is this a bug? Thanks, Thomas -- Thomas Thrainer | Software Engineer | thomasth at google.com | Google Germany GmbH Dienerstr. 12 80331 München Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Graham Law, Katherine Stephens -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20130426/c74da9e1/attachment.htm>