Hi folks,<br><br>I&#39;ve got a pacemaker cluster setup as follows:<br><pre># crm configure<br>property no-quorum-policy=ignore<br>property stonith-enabled=&quot;true&quot;<br><br>primitive drbd_qs ocf:linbit:drbd params drbd_resource=&quot;r0&quot; op monitor interval=&quot;15s&quot; op stop timeout=300s op start timeout=300s<br>

ms ms_drbd_qs drbd_qs meta master-max=&quot;1&quot; master-node-max=&quot;1&quot; clone-max=&quot;2&quot; clone-node-max=&quot;1&quot; notify=&quot;true&quot; <br><br>primitive qs_fs ocf:heartbeat:Filesystem params device=&quot;/dev/drbd/by-res/r0&quot; directory=&quot;/mnt/drbd1/&quot; fstype=&quot;ext4&quot; options=&quot;barrier=0,noatime,nouser_xattr,data=writeback&quot; op monitor interval=30s OCF_CHECK_LEVEL=20 on-fail=standby meta failure-timeout=&quot;60&quot;<br>

primitive qs_ip ocf:heartbeat:IPaddr2 params ip=&quot;172.16.10.155&quot; nic=&quot;eth0:0&quot; op monitor interval=60s on-fail=standby meta failure-timeout=&quot;60&quot;<br>primitive qs_apache2 ocf:bering:apache2 op monitor interval=30s on-fail=standby meta failure-timeout=&quot;60&quot;<br>

primitive qs_rabbitmq ocf:bering:rabbitmq op start timeout=120s op stop timeout=300s op monitor interval=60s on-fail=standby meta failure-timeout=&quot;60&quot;<br>group queryserver qs_fs qs_ip qs_apache2 qs_rabbitmq<br>
<br>
primitive qs1-stonith stonith:external/ipmi params hostname=qs1 ipaddr=172.16.10.134 userid=root passwd=blah interface=lan op start interval=0s timeout=20s requires=nothing op monitor interval=600s timeout=20s requires=nothing<br>

primitive qs2-stonith stonith:external/ipmi params hostname=qs2 ipaddr=172.16.10.133 userid=root passwd=blah interface=lan op start interval=0s timeout=20s requires=nothing op monitor interval=600s timeout=20s requires=nothing<br>

<br>location l-st-qs1 qs1-stonith -inf: qs1<br>location l-st-qs2 qs2-stonith -inf: qs2<br>colocation queryserver_on_drbd inf: queryserver ms_drbd_qs:Master<br><br>order queryserver_after_drbd inf: ms_drbd_qs:promote queryserver:start<br>

order ip_after_fs inf: qs_fs qs_ip<br>order apache_after_ip inf: qs_ip qs_apache2<br>order rabbitmq_after_ip inf: qs_ip qs_rabbitmq<br><br>verify<br>commit<br><br><font style="font-family: arial,helvetica,sans-serif;" size="2">Under normal operations, this is what I expect the cluster to look like:</font><br>

<br><br># crm status                                    <br>============                                               <br>Last updated: Tue Jan 26 11:55:50 2010                     <br>Current DC: qs1 - partition with quorum                    <br>

2 Nodes configured, 2 expected votes                       <br>4 Resources configured.                                    <br>============                                               <br><br>Online: [ qs1 qs2 ]<br><br>
 Master/Slave Set: ms_drbd_qs<br>
     Masters: [ qs1 ]        <br>     Slaves: [ qs2 ]         <br> qs1-stonith    (stonith:external/ipmi):        Started qs2<br> qs2-stonith    (stonith:external/ipmi):        Started qs1<br> Resource Group: queryserver                               <br>

     qs_fs      (ocf::heartbeat:Filesystem):    Started qs1<br>     qs_ip      (ocf::heartbeat:IPaddr2):       Started qs1<br>     qs_apache2 (ocf::bering:apache2):  Started qs1        <br>     qs_rabbitmq        (ocf::bering:rabbitmq): Started qs1<br>

<br><font style="font-family: arial,helvetica,sans-serif;" size="2">If however a failure occurs, my configuration instructs pacemaker to put the node in which the failure occurs into standby for 60 seconds:</font><br><br>

# killall -9 rabbit<br><br># crm status                                    <br>============                                               <br>Last updated: Tue Jan 26 11:55:56 2010                     <br>Current DC: qs1 - partition with quorum                    <br>

2 Nodes configured, 2 expected votes                       <br>4 Resources configured.                                    <br>============                                               <br><br>Node qs1: standby (on-fail)<br>

Online: [ qs2 ]            <br><br> Master/Slave Set: ms_drbd_qs<br>     Masters: [ qs1 ]        <br>     Slaves: [ qs2 ]         <br> qs1-stonith    (stonith:external/ipmi):        Started qs2<br> qs2-stonith    (stonith:external/ipmi):        Started qs1<br>

 Resource Group: queryserver                               <br>     qs_fs      (ocf::heartbeat:Filesystem):    Started qs1<br>     qs_ip      (ocf::heartbeat:IPaddr2):       Started qs1<br>     qs_apache2 (ocf::bering:apache2):  Started qs1        <br>

     qs_rabbitmq        (ocf::bering:rabbitmq): Started qs1 FAILED<br><br>Failed actions:<br>    qs_rabbitmq_monitor_60000 (node=qs1, call=32, rc=7, status=complete): not running<br><br><font style="font-family: arial,helvetica,sans-serif;" size="2">After the 60 second timeout, I would expect the node to come back online, and DRBD replication to resume, alas this is what I get:</font><br>

<br><br># crm status<br>============<br>Last updated: Tue Jan 26 11:58:36 2010<br>Current DC: qs1 - partition with quorum<br>2 Nodes configured, 2 expected votes<br>4 Resources configured.<br>============<br><br>Online: [ qs1 qs2 ]<br>

<br> Master/Slave Set: ms_drbd_qs<br>     Masters: [ qs2 ]<br>     Stopped: [ drbd_qs:0 ]<br> qs1-stonith    (stonith:external/ipmi):        Started qs2<br> Resource Group: queryserver<br>     qs_fs      (ocf::heartbeat:Filesystem):    Started qs2<br>

     qs_ip      (ocf::heartbeat:IPaddr2):       Started qs2<br>     qs_apache2 (ocf::bering:apache2):  Started qs2<br>     qs_rabbitmq        (ocf::bering:rabbitmq): Started qs2<br></pre><br>DRBD fail-over works properly under certain conditions (eg: if I stop corosync, powercycle the box, manual standby fail-over), however in the case described above (one of the monitored services gets killed) leads to the undesirable state of DRBD no longer replicating.<br>

<br>Does anyone have some ideas on what would need to be changed in the pacemaker/corosync configuration for the node to come back online and go from stopped to slave state?<br><br>Thanks,<br><br>-- <br>Mark Steele<br>Director of development<br>

Bering Media Inc.<br><br>