<div dir="ltr">Having an issue on a newly built CentOS 7.2.1511 NFS cluster with DRBD (drbd84-utils-8.9.5-1 with kmod-drbd84-8.4.7-1_1). At this point, the resources consist of a cluster address, a DRBD device mirroring between the two cluster nodes, the file system, and the nfs-server resource. The resources all behave properly until an extended failover or outage.<div><br></div><div>I have tested failover in several ways (&quot;pcs cluster standby&quot;, &quot;pcs cluster stop&quot;, &quot;init 0&quot;, &quot;init 6&quot;, &quot;echo b &gt; /proc/sysrq-trigger&quot;, etc.) and the symptoms are that, until the killed node is brought back into the cluster, failover never seems to complete. The DRBD device appears on the remaining node to be in a &quot;Secondary/Unknown&quot; state, and the resources end up looking like:</div><div><br></div><div><div># pcs status</div><div>Cluster name: nfscluster</div><div>Last updated: Wed Mar 16 12:05:33 2016          Last change: Wed Mar 16 12:04:46 2016 by root via cibadmin on nfsnode01</div><div>Stack: corosync</div><div>Current DC: nfsnode01 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum</div><div>2 nodes and 5 resources configured</div><div><br></div><div>Online: [ nfsnode01 ]</div><div>OFFLINE: [ nfsnode02 ]</div><div><br></div><div>Full list of resources:</div><div><br></div><div> nfsVIP      (ocf::heartbeat:IPaddr2):       Started nfsnode01</div><div> nfs-server     (systemd:nfs-server):   Stopped</div><div> Master/Slave Set: drbd_master [drbd_dev]</div><div>     Slaves: [ nfsnode01 ]</div><div>     Stopped: [ nfsnode02 ]</div><div> drbd_fs   (ocf::heartbeat:Filesystem):    Stopped</div><div><br></div><div>PCSD Status:</div><div>  nfsnode01: Online</div><div>  nfsnode02: Online</div><div><br></div><div>Daemon Status:</div><div>  corosync: active/enabled</div><div>  pacemaker: active/enabled</div><div>  pcsd: active/enabled</div></div><div><br></div><div>As soon as I bring the second node back online, the failover completes. But this is obviously not a good state, as an extended outage for any reason on one node essentially kills the cluster services. There&#39;s obviously something I&#39;ve missed in configuring the resources, but I haven&#39;t been able to pinpoint it yet.</div><div><br></div><div>Perusing the logs, it appears that, upon the initial failure, DRBD does in fact promote the drbd_master resource, but immediately after that, pengine calls for it to be demoted for reasons I haven&#39;t been able to determine yet, but seems to be tied to the fencing configuration. I can see that the crm-fence-peer.sh script is called, but it almost seems like it&#39;s fencing the wrong node... Indeed, I do see that it adds a -INFINITY location constraint for the surviving node, which would explain the decision to demote the DRBD master.</div><div><br></div><div>My DRBD resource looks like this:</div><div><br></div><div><div># cat /etc/drbd.d/drbd0.res</div><div>resource drbd0 {</div><div><br></div><div>        protocol C;</div><div>        startup { wfc-timeout 0; degr-wfc-timeout 120; }</div><div><br></div><div>        disk {</div><div>            on-io-error detach;</div><div>            fencing resource-only;</div><div>        }</div><div><br></div><div>        handlers {</div><div>            fence-peer &quot;/usr/lib/drbd/crm-fence-peer.sh&quot;;</div><div>            after-resync-target &quot;/usr/lib/drbd/crm-unfence-peer.sh&quot;;</div><div>        }</div><div><br></div><div>        on nfsnode01 {</div><div>                device /dev/drbd0;</div><div>                disk /dev/vg_nfs/lv_drbd0;</div><div>                meta-disk internal;</div><div>                address <a href="http://10.0.0.2:7788">10.0.0.2:7788</a>;</div><div>        }</div><div><br></div><div>        on nfsnode02 {</div><div>                device /dev/drbd0;</div><div>                disk /dev/vg_nfs/lv_drbd0;</div><div>                meta-disk internal;</div><div>                address <a href="http://10.0.0.3:7788">10.0.0.3:7788</a>;</div><div>        }</div><div>}</div></div><div><br></div><div>If I comment out the three lines having to do with fencing, the failover works properly. But I&#39;d prefer to have the fencing there in the odd chance that we end up with a split brain instead of just a node outage...</div><div><br></div><div>And, here&#39;s &quot;pcs config --full&quot;:</div><div><br></div><div><div># pcs config --full</div><div>Cluster Name: nfscluster</div><div>Corosync Nodes:</div><div> nfsnode01 nfsnode02</div><div>Pacemaker Nodes:</div><div> nfsnode01 nfsnode02</div><div><br></div><div>Resources:</div><div> Resource: nfsVIP (class=ocf provider=heartbeat type=IPaddr2)</div><div>  Attributes: ip=10.0.0.1 cidr_netmask=24</div><div>  Operations: start interval=0s timeout=20s (nfsVIP-start-interval-0s)</div><div>              stop interval=0s timeout=20s (nfsVIP-stop-interval-0s)</div><div>              monitor interval=15s (nfsVIP-monitor-interval-15s)</div><div> Resource: nfs-server (class=systemd type=nfs-server)</div><div>  Operations: monitor interval=60s (nfs-server-monitor-interval-60s)</div><div> Master: drbd_master</div><div>  Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true</div><div>  Resource: drbd_dev (class=ocf provider=linbit type=drbd)</div><div>   Attributes: drbd_resource=drbd0</div><div>   Operations: start interval=0s timeout=240 (drbd_dev-start-interval-0s)</div><div>               promote interval=0s timeout=90 (drbd_dev-promote-interval-0s)</div><div>               demote interval=0s timeout=90 (drbd_dev-demote-interval-0s)</div><div>               stop interval=0s timeout=100 (drbd_dev-stop-interval-0s)</div><div>               monitor interval=29s role=Master (drbd_dev-monitor-interval-29s)</div><div>               monitor interval=31s role=Slave (drbd_dev-monitor-interval-31s)</div><div> Resource: drbd_fs (class=ocf provider=heartbeat type=Filesystem)</div><div>  Attributes: device=/dev/drbd0 directory=/exports/drbd0 fstype=xfs</div><div>  Operations: start interval=0s timeout=60 (drbd_fs-start-interval-0s)</div><div>              stop interval=0s timeout=60 (drbd_fs-stop-interval-0s)</div><div>              monitor interval=20 timeout=40 (drbd_fs-monitor-interval-20)</div><div><br></div><div>Stonith Devices:</div><div>Fencing Levels:</div><div><br></div><div>Location Constraints:</div><div>Ordering Constraints:</div><div>  start nfsVIP then start nfs-server (kind:Mandatory) (id:order-nfsVIP-nfs-server-mandatory)</div><div>  start drbd_fs then start nfs-server (kind:Mandatory) (id:order-drbd_fs-nfs-server-mandatory)</div><div>  promote drbd_master then start drbd_fs (kind:Mandatory) (id:order-drbd_master-drbd_fs-mandatory)</div><div>Colocation Constraints:</div><div>  nfs-server with nfsVIP (score:INFINITY) (id:colocation-nfs-server-nfsVIP-INFINITY)</div><div>  nfs-server with drbd_fs (score:INFINITY) (id:colocation-nfs-server-drbd_fs-INFINITY)</div><div>  drbd_fs with drbd_master (score:INFINITY) (with-rsc-role:Master) (id:colocation-drbd_fs-drbd_master-INFINITY)</div><div><br></div><div>Resources Defaults:</div><div> resource-stickiness: 100</div><div> failure-timeout: 60</div><div>Operations Defaults:</div><div> No defaults set</div></div><div><div><br></div><div>Cluster Properties:</div><div> cluster-infrastructure: corosync</div><div> cluster-name: nfscluster</div><div> dc-version: 1.1.13-10.el7_2.2-44eb2dd</div><div> have-watchdog: false</div><div> maintenance-mode: false</div><div> stonith-enabled: false</div><div><br></div></div></div>