<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">On 2019-04-17 12:20 p.m., JCA wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAFy1yb1dkiXh4wY6TYS00Cu1RPH860a8JTXsKNMUdTczzCkx0Q@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">
        <div dir="ltr">
          <div dir="ltr">
            <div dir="ltr">
              <div dir="ltr">I have a two-node cluster, in the way of
                two CentOS 7 VMs, as follows:
                <div><br>
                </div>
                <div>
                  <div>Cluster name: ClusterOne</div>
                  <div>Stack: corosync</div>
                  <div>Current DC: two (version
                    1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum</div>
                  <div>Last updated: Wed Apr 17 09:43:42 2019</div>
                  <div>Last change: Wed Apr 17 09:39:52 2019 by root via
                    cibadmin on one</div>
                  <div><br>
                  </div>
                  <div>2 nodes configured</div>
                  <div>4 resources configured</div>
                  <div><br>
                  </div>
                  <div>Online: [ one two ]</div>
                  <div><br>
                  </div>
                  <div>Full list of resources:</div>
                  <div><br>
                  </div>
                  <div> MyAppCluster<span style="white-space:pre">        </span>(ocf::myapps:MyApp):<span style="white-space:pre">        </span>Started
                    one</div>
                  <div> Master/Slave Set: DrbdDataClone [DrbdData]</div>
                  <div>     Masters: [ one ]</div>
                  <div>     Slaves: [ two ]</div>
                  <div> DrbdFS<span style="white-space:pre">        </span>(ocf::heartbeat:Filesystem):<span style="white-space:pre">        </span>Started
                    one</div>
                  <div><br>
                  </div>
                  <div>Daemon Status:</div>
                  <div>  corosync: active/enabled</div>
                  <div>  pacemaker: active/enabled</div>
                  <div>  pcsd: active/enabled</div>
                </div>
                <div><br>
                </div>
                <div>The DRBD software that I using is the following:</div>
                <div><br>
                </div>
                <div>
                  <div>drbd84-utils.x86_64                   
                    9.6.0-1.el7.elrepo                 @elrepo  </div>
                  <div>kmod-drbd84.x86_64               
                     8.4.11-1.1.el7_6.elrepo        @elrepo  </div>
                </div>
                <div><br>
                </div>
                <div>The nodes have been configured to share an ext4
                  partition, DrbdFS has been configured to start before
                  MyAppCluster, and the ClusterOne cluster has been
                  configured to start automatically at boot time.</div>
                <div><br>
                </div>
                <div>The setup above works, in that node two takes over
                  from node one when the latter becomes unreachable, and
                  the DrbdFS filesystem automatically becomes available
                  to node two, at the correct mount point, in that
                  situation.</div>
                <div><br>
                </div>
                <div>Now when I reboot one and two, occasionally - but
                  often enough to make me feel uneasy - DrbdFS comes up
                  in a split-brain condition. What follows are the boot
                  time syslog traces I typically get in such a case:</div>
                <div><br>
                </div>
                <div>Apr 17 09:35:59 one pengine[3663]:  notice:  *
                  Start      ClusterOne    (           one )</div>
                <div>
                  <div>Apr 17 09:35:59 one pengine[3663]:  notice:  *
                    Start      DrbdFS           (          one )</div>
                  <div>Apr 17 09:35:59 one pengine[3663]:  notice:
                    Calculated transition 4, saving inputs in
                    /var/lib/pacemaker/pengine/pe-input-560.bz2</div>
                  <div>Apr 17 09:35:59 one crmd[3664]:  notice:
                    Initiating monitor operation DrbdData_monitor_30000
                    on two</div>
                  <div>Apr 17 09:35:59 one crmd[3664]:  notice:
                    Initiating start operation DrbdFS_start_0 locally on
                    one</div>
                  <div>Apr 17 09:35:59 one kernel: drbd myapp-data:
                    Handshake successful: Agreed network protocol
                    version 101</div>
                  <div>Apr 17 09:35:59 one kernel: drbd myapp-data:
                    Feature flags enabled on protocol level: 0xf TRIM
                    THIN_RESYNC WRITE_SAME WRITE_ZEROES.</div>
                  <div>Apr 17 09:35:59 one kernel: drbd myapp-data:
                    conn( WFConnection -&gt; WFReportParams )</div>
                  <div>Apr 17 09:35:59 one kernel: drbd myapp-data:
                    Starting ack_recv thread (from drbd_r_myapp-dat
                    [4406])</div>
                  <div>Apr 17 09:35:59 one kernel: block drbd1:
                    drbd_sync_handshake:</div>
                  <div>Apr 17 09:35:59 one kernel: block drbd1: self
                    002DDA8B166FC899:8DD977B102052FD2:</div>
                  <div>BE3891694D7BCD54:BE3791694D7BCD54 bits:0 flags:0</div>
                  <div>Apr 17 09:35:59 one kernel: block drbd1: peer
                    D12D1947C4ECF940:8DD977B102052FD2:</div>
                  <div>BE3891694D7BCD54:BE3791694D7BCD54 bits:32 flags:0</div>
                  <div>Apr 17 09:35:59 one kernel: block drbd1:
                    uuid_compare()=100 by rule 90</div>
                  <div>Apr 17 09:35:59 one kernel: block drbd1: helper
                    command: /sbin/drbdadm initial-split-brain minor-1</div>
                  <div>Apr 17 09:35:59 one Filesystem(DrbdFS)[4531]:
                    INFO: Running start for /dev/drbd1 on /var/lib/myapp</div>
                  <div>Apr 17 09:35:59 one kernel: block drbd1: helper
                    command: /sbin/drbdadm initial-split-brain minor-1
                    exit code 0 (0x0)</div>
                  <div>Apr 17 09:35:59 one kernel: block drbd1:
                    Split-Brain detected but unresolved, dropping
                    connection!</div>
                  <div>Apr 17 09:35:59 one kernel: block drbd1: helper
                    command: /sbin/drbdadm split-brain minor-1</div>
                  <div>Apr 17 09:35:59 one kernel: drbd myapp-data: meta
                    connection shut down by peer.</div>
                  <div>Apr 17 09:35:59 one kernel: drbd myapp-data:
                    conn( WFReportParams -&gt; NetworkFailure )</div>
                  <div>Apr 17 09:35:59 one kernel: drbd myapp-data:
                    ack_receiver terminated</div>
                  <div>Apr 17 09:35:59 one kernel: drbd myapp-data:
                    Terminating drbd_a_myapp-dat</div>
                  <div>Apr 17 09:35:59 one kernel: block drbd1: helper
                    command: /sbin/drbdadm split-brain minor-1 exit code
                    0 (0x0)</div>
                </div>
                <div><br>
                </div>
                <div>   Fixing the problem is not difficult, by manual
                  intervention, once both nodes are up and running.
                  However, I would like to understand why the
                  split-brain condition takes sometimes on booting up
                  and,  more importantly, how to prevent this from
                  happening, if at all possible.</div>
                <div><br>
                </div>
                <div>   Suggestions?</div>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <p>Stonith in pacemaker, once tested, fencing in DRBD. This is what
      fencing is for.</p>
    <p>digimer<br>
    </p>
  </body>
</html>