<div dir="ltr">Hello lists,<br>I want to tell you about my experience with drbd9 triple replication, and ugly synchronization bugs.<br><br>First, I installed drbd9 on my Ubuntu 16.04, and latest drbd-dkms module:<br><br>    # uname -a<br>    Linux controller03 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux<br><br>    # cat /proc/drbd <br>    version: 9.0.6-1 (api:2/proto:86-112)<br>    GIT-hash: 08cda190c4f544a0c4e15ba792bbf47c69707b42 build by root@controller03, 2017-02-15 22:52:06<br>    Transports (api:15): tcp (1.0.0)<br><br>I don&#39;t used drbdmanage tool because it requires additional volume group, I use drbdadm instead.<br>My idea is create one replicated device between three primary nodes, and shared cluster filesytem (gfs2) on it.<br>Here is my config:<br><br>    # cat /etc/drbd.d/r0.res<br>    resource r0 {<br>        net {<br>            cram-hmac-alg   sha1;<br>            shared-secret   &quot;hackme&quot;;<br>            allow-two-primaries yes;<br>            after-sb-0pri discard-zero-changes;<br>            after-sb-1pri discard-secondary;<br>            after-sb-2pri disconnect;<br>        }<br>        startup {<br>            wfc-timeout  15;<br>            degr-wfc-timeout 60;<br>            become-primary-on both;<br>        }<br>        volume 0 {<br>            device     /dev/drbd0 minor 0;<br>            disk       /dev/mapper/cl-drbd;<br>            meta-disk  internal;<br>        }<br>        on controller01 {<br>            node-id    0;<br>            address    <a href="http://192.168.101.11:7788">192.168.101.11:7788</a>;<br>        }<br>        on controller02 {<br>            node-id    1;<br>            address    <a href="http://192.168.101.12:7788">192.168.101.12:7788</a>;<br>        }<br>        on controller03 {<br>            node-id    2;<br>            address    <a href="http://192.168.101.13:7788">192.168.101.13:7788</a>;<br>        }<br>        connection-mesh {<br>            hosts controller01 controller02 controller03;<br>            net {<br>                protocol C;<br>            }<br>        }<br>    }<br><br>After deploy and initial syncronisation evrething works fine:<br><br>    # drbdsetup status<br>    r0 role:Primary<br>      disk:UpToDate<br>      controller02 role:Primary<br>        peer-disk:UpToDate<br>      controller03 role:Primary<br>        peer-disk:UpToDate<br><br>But sometimes, or force reboot one of node for example, I got this incomprehensible situation:<br><br>    # ssh contoller01 drbdsetup status<br>    r0 role:Primary<br>      disk:UpToDate<br>      controller02 role:Primary<br>        peer-disk:UpToDate<br>      controller03 connection:Connecting<br> <br>    # ssh contoller02 drbdsetup status<br>    r0 role:Primary<br>      disk:UpToDate<br>      controller01 role:Primary<br>        peer-disk:UpToDate<br>      controller03 role:Primary<br>        peer-disk:UpToDate<br> <br>    # ssh contoller03 drbdsetup status<br>    r0 role:Primary<br>      disk:UpToDate<br>      controller01 connection:Connecting<br>      controller02 role:Primary<br>        peer-disk:UpToDater0 role:Primary<br><br><br>dmesg output from controller01:<br><br>    [  921.394274] drbd r0 controller03: conn( NetworkFailure -&gt; Unconnected )<br>    [  921.394295] drbd r0 controller03: Restarting receiver thread<br>    [  921.394328] drbd r0 controller03: conn( Unconnected -&gt; Connecting )<br>    [  921.885827] drbd r0 controller02: Preparing remote state change 642056233 (primary_nodes=1, weak_nodes=FFFFFFFFFFFFFFFC)<br>    [  921.885948] drbd r0 controller02: Aborting remote state change 642056233<br>    [  921.909449] drbd r0 controller03: Handshake to peer 2 successful: Agreed network protocol version 112<br>    [  921.909452] drbd r0 controller03: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.<br>    [  921.909578] drbd r0 controller03: Peer authenticated using 20 bytes HMAC<br>    [  921.909597] drbd r0 controller03: Starting ack_recv thread (from drbd_r_r0 [2361])<br>    [  922.014142] drbd r0: Preparing cluster-wide state change 668430455 (0-&gt;2 499/145)<br>    [  922.014327] drbd r0: Aborting cluster-wide state change 668430455 (0ms) rv = -10<br>    [  922.014336] drbd r0 controller03: Failure to connect; retrying<br>    [  922.014344] drbd r0 controller03: conn( Connecting -&gt; NetworkFailure )<br>    [  922.014390] drbd r0 controller03: ack_receiver terminated<br>    [  922.014391] drbd r0 controller03: Terminating ack_recv thread<br>    [  922.038295] drbd r0 controller03: Connection closed<br><br><br>dmesg output from controller02:<br><br>    [  524.670133] drbd r0 controller01: conn( BrokenPipe -&gt; Unconnected )<br>    [  524.670145] drbd r0 controller01: Restarting receiver thread<br>    [  524.670155] drbd r0 controller01: conn( Unconnected -&gt; Connecting )<br>    [  525.170032] drbd r0 controller01: Handshake to peer 0 successful: Agreed network protocol version 112<br>    [  525.170034] drbd r0 controller01: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.<br>    [  525.170169] drbd r0 controller01: Peer authenticated using 20 bytes HMAC<br>    [  525.170176] drbd r0 controller01: Starting ack_recv thread (from drbd_r_r0 [1660])<br>    [  525.202747] drbd r0 controller01: Preparing remote state change 3822078331 (primary_nodes=0, weak_nodes=0)<br>    [  525.202918] drbd r0 controller01: Aborting remote state change 3822078331<br>    [  525.222734] drbd r0 controller01: sock was shut down by peer<br>    [  525.222742] drbd r0 controller01: conn( Connecting -&gt; BrokenPipe )<br>    [  525.222763] drbd r0 controller01: ack_receiver terminated<br>    [  525.222764] drbd r0 controller01: Terminating ack_recv thread<br>    [  525.246142] drbd r0 controller01: Connection closed<br><br>These actions also do not change anything:<br><br>    # drbdadm secondary r0<br>    # drbdadm disconnect r0<br>    # drbdadm -- --discard-my-data connect r0<br><br>Sometimes in the console output I see errors like theese:<br><br>    [  686.816906] drbd r0: State change failed: Need a connection to start verify or resync<br>    [  730.143054] drbd r0 tcp:controller01: Closing unexpected connection from <a href="http://192.168.101.12:7788">192.168.101.12:7788</a><br><br>Also sometimes happens that stops working is not one but both connections.<br><br></div>