Hi folks,<br><br>I've got a pacemaker cluster setup as follows:<br><pre># crm configure<br>property no-quorum-policy=ignore<br>property stonith-enabled="true"<br><br>primitive drbd_qs ocf:linbit:drbd params drbd_resource="r0" op monitor interval="15s" op stop timeout=300s op start timeout=300s<br>
ms ms_drbd_qs drbd_qs meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" <br><br>primitive qs_fs ocf:heartbeat:Filesystem params device="/dev/drbd/by-res/r0" directory="/mnt/drbd1/" fstype="ext4" options="barrier=0,noatime,nouser_xattr,data=writeback" op monitor interval=30s OCF_CHECK_LEVEL=20 on-fail=standby meta failure-timeout="60"<br>
primitive qs_ip ocf:heartbeat:IPaddr2 params ip="172.16.10.155" nic="eth0:0" op monitor interval=60s on-fail=standby meta failure-timeout="60"<br>primitive qs_apache2 ocf:bering:apache2 op monitor interval=30s on-fail=standby meta failure-timeout="60"<br>
primitive qs_rabbitmq ocf:bering:rabbitmq op start timeout=120s op stop timeout=300s op monitor interval=60s on-fail=standby meta failure-timeout="60"<br>group queryserver qs_fs qs_ip qs_apache2 qs_rabbitmq<br>
<br>
primitive qs1-stonith stonith:external/ipmi params hostname=qs1 ipaddr=172.16.10.134 userid=root passwd=blah interface=lan op start interval=0s timeout=20s requires=nothing op monitor interval=600s timeout=20s requires=nothing<br>
primitive qs2-stonith stonith:external/ipmi params hostname=qs2 ipaddr=172.16.10.133 userid=root passwd=blah interface=lan op start interval=0s timeout=20s requires=nothing op monitor interval=600s timeout=20s requires=nothing<br>
<br>location l-st-qs1 qs1-stonith -inf: qs1<br>location l-st-qs2 qs2-stonith -inf: qs2<br>colocation queryserver_on_drbd inf: queryserver ms_drbd_qs:Master<br><br>order queryserver_after_drbd inf: ms_drbd_qs:promote queryserver:start<br>
order ip_after_fs inf: qs_fs qs_ip<br>order apache_after_ip inf: qs_ip qs_apache2<br>order rabbitmq_after_ip inf: qs_ip qs_rabbitmq<br><br>verify<br>commit<br><br><font style="font-family: arial,helvetica,sans-serif;" size="2">Under normal operations, this is what I expect the cluster to look like:</font><br>
<br><br># crm status <br>============ <br>Last updated: Tue Jan 26 11:55:50 2010 <br>Current DC: qs1 - partition with quorum <br>
2 Nodes configured, 2 expected votes <br>4 Resources configured. <br>============ <br><br>Online: [ qs1 qs2 ]<br><br>
Master/Slave Set: ms_drbd_qs<br>
Masters: [ qs1 ] <br> Slaves: [ qs2 ] <br> qs1-stonith (stonith:external/ipmi): Started qs2<br> qs2-stonith (stonith:external/ipmi): Started qs1<br> Resource Group: queryserver <br>
qs_fs (ocf::heartbeat:Filesystem): Started qs1<br> qs_ip (ocf::heartbeat:IPaddr2): Started qs1<br> qs_apache2 (ocf::bering:apache2): Started qs1 <br> qs_rabbitmq (ocf::bering:rabbitmq): Started qs1<br>
<br><font style="font-family: arial,helvetica,sans-serif;" size="2">If however a failure occurs, my configuration instructs pacemaker to put the node in which the failure occurs into standby for 60 seconds:</font><br><br>
# killall -9 rabbit<br><br># crm status <br>============ <br>Last updated: Tue Jan 26 11:55:56 2010 <br>Current DC: qs1 - partition with quorum <br>
2 Nodes configured, 2 expected votes <br>4 Resources configured. <br>============ <br><br>Node qs1: standby (on-fail)<br>
Online: [ qs2 ] <br><br> Master/Slave Set: ms_drbd_qs<br> Masters: [ qs1 ] <br> Slaves: [ qs2 ] <br> qs1-stonith (stonith:external/ipmi): Started qs2<br> qs2-stonith (stonith:external/ipmi): Started qs1<br>
Resource Group: queryserver <br> qs_fs (ocf::heartbeat:Filesystem): Started qs1<br> qs_ip (ocf::heartbeat:IPaddr2): Started qs1<br> qs_apache2 (ocf::bering:apache2): Started qs1 <br>
qs_rabbitmq (ocf::bering:rabbitmq): Started qs1 FAILED<br><br>Failed actions:<br> qs_rabbitmq_monitor_60000 (node=qs1, call=32, rc=7, status=complete): not running<br><br><font style="font-family: arial,helvetica,sans-serif;" size="2">After the 60 second timeout, I would expect the node to come back online, and DRBD replication to resume, alas this is what I get:</font><br>
<br><br># crm status<br>============<br>Last updated: Tue Jan 26 11:58:36 2010<br>Current DC: qs1 - partition with quorum<br>2 Nodes configured, 2 expected votes<br>4 Resources configured.<br>============<br><br>Online: [ qs1 qs2 ]<br>
<br> Master/Slave Set: ms_drbd_qs<br> Masters: [ qs2 ]<br> Stopped: [ drbd_qs:0 ]<br> qs1-stonith (stonith:external/ipmi): Started qs2<br> Resource Group: queryserver<br> qs_fs (ocf::heartbeat:Filesystem): Started qs2<br>
qs_ip (ocf::heartbeat:IPaddr2): Started qs2<br> qs_apache2 (ocf::bering:apache2): Started qs2<br> qs_rabbitmq (ocf::bering:rabbitmq): Started qs2<br></pre><br>DRBD fail-over works properly under certain conditions (eg: if I stop corosync, powercycle the box, manual standby fail-over), however in the case described above (one of the monitored services gets killed) leads to the undesirable state of DRBD no longer replicating.<br>
<br>Does anyone have some ideas on what would need to be changed in the pacemaker/corosync configuration for the node to come back online and go from stopped to slave state?<br><br>Thanks,<br><br>-- <br>Mark Steele<br>Director of development<br>
Bering Media Inc.<br><br>