[DRBD-user] DRBD resource fenced by crm-fence-peer.sh with exit code 5

Wed Jun 11 22:50:37 CEST 2014

Hello,

I am in the process of testing a 3 node (2 real nodes and 1 quorum node) cluster
with Pacemaker 1.1.11 + Corosync 2.3.3 and DRBD 8.3.11 on Ubuntu 12.04. I have
backported most of these packages in this PPA:
https://launchpad.net/~xespackages/+archive/clustertesting

I have configured a one-primary DRBD resource and configured it to run on either
node (node0 or node1):
primitive p_drbd_drives ocf:linbit:drbd \
        params drbd_resource="r0" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100" \
        op monitor interval="10" role="Master" timeout="90" \
        op monitor interval="20" role="Slave" timeout="60"
ms ms_drbd_drives p_drbd_drives \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Master"
colocation c_drbd_fs_services inf: g_store ms_drbd_drives:Master
order o_drbd_fs_services inf: ms_drbd_drives:promote g_store:start

As you can see, it is colocated with a group of other resources (g_store) and the
above order constraint makes it promote the DRBD resource before starting the 
other resources. Due to this bug, I am stuck at DRBD 8.3.11:
https://bugs.launchpad.net/ubuntu/+source/drbd8/+bug/1185756

However, this version of DRBD's crm-fence-peer.sh doesn't support newer versions 
of pacemaker which no longer use ha="active" as part of the <node_state> tag:
http://lists.linbit.com/pipermail/drbd-user/2012-October/019204.html

Therefore, I updated the copy of /usr/lib/drbd/crm-fence-peer.sh on all nodes to 
use the latest version in the DRBD 8.3 series (2013-09-09):
http://git.linbit.com/gitweb.cgi?p=drbd-8.3.git;a=history;f=scripts/crm-fence-peer.sh;h=6c8c6a4eda870b506b175d9833fea94761237d20;hb=HEAD

During testing, I've tried shutting down the currently-active node. When doing
so, the fence peer handler inserts the constraint correctly, but it exits with 
exit code 5:
INFO peer is not reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-ms_drbd_drives'

crm-fence-peer.sh exit codes:
http://www.drbd.org/users-guide-8.3/s-fence-peer.html

I can see this constraint in the CIB, however, the remaining (still secondary)
node fails to promote. Moreover, when the original node is powered back on, it 
repeatedly attempts to remove the constraint by calling crm-unfence-peer.sh, 
which exits with exit code 0, removing the constraint. However it doesn't seem to 
recognize this and repeatedly keeps calling crm-unfence-peer.sh. 

How can I resolve these problems with crm-fence-peer.sh? Is exit code 5 an 
acceptable state to allow DRBD to promote the resource on the remaining node? It
would seem so given that the constraint would prevent the DRBD resource from
being promoted on the bad node until it has rejoined the cluster.

Thanks,

Andrew