Hi Andrew,<br><br>I fixed it by removing the on-fail=standby and using migration-threshold="1". It now behaves as we expect it to.<br><br>Unfortunately I've since re-imaged the test boxes, so no dice for hb_report.<br>
<br>If you do want to try to reproduce, I was running on Gentoo and installed everything from source:<br><br>Corosync 1.1.2<br>openais 1.1.0<br>Cluster-Resource-Agents-4ac8bf7a64fe<br>Pacemaker-1-0-6695fd350a64<br>Reusable-Cluster-Components-2905a7843039<br>
DRBD 8.3.6<br><br>Cheers,<br><br>Mark<br><br><br><div class="gmail_quote">On Fri, Jan 29, 2010 at 3:32 AM, Andrew Beekhof <span dir="ltr"><<a href="mailto:andrew@beekhof.net">andrew@beekhof.net</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div></div><div class="h5">On Wed, Jan 27, 2010 at 12:45 AM, Mark Steele <<a href="mailto:msteele@beringmedia.com">msteele@beringmedia.com</a>> wrote:<br>
> Hi folks,<br>
><br>
> I've got a pacemaker cluster setup as follows:<br>
><br>
> # crm configure<br>
> property no-quorum-policy=ignore<br>
> property stonith-enabled="true"<br>
><br>
> primitive drbd_qs ocf:linbit:drbd params drbd_resource="r0" op monitor<br>
> interval="15s" op stop timeout=300s op start timeout=300s<br>
><br>
><br>
> ms ms_drbd_qs drbd_qs meta master-max="1" master-node-max="1" clone-max="2"<br>
> clone-node-max="1" notify="true"<br>
><br>
> primitive qs_fs ocf:heartbeat:Filesystem params device="/dev/drbd/by-res/r0"<br>
> directory="/mnt/drbd1/" fstype="ext4"<br>
> options="barrier=0,noatime,nouser_xattr,data=writeback" op monitor<br>
> interval=30s OCF_CHECK_LEVEL=20 on-fail=standby meta failure-timeout="60"<br>
><br>
><br>
> primitive qs_ip ocf:heartbeat:IPaddr2 params ip="172.16.10.155" nic="eth0:0"<br>
> op monitor interval=60s on-fail=standby meta failure-timeout="60"<br>
> primitive qs_apache2 ocf:bering:apache2 op monitor interval=30s<br>
> on-fail=standby meta failure-timeout="60"<br>
><br>
><br>
> primitive qs_rabbitmq ocf:bering:rabbitmq op start timeout=120s op stop<br>
> timeout=300s op monitor interval=60s on-fail=standby meta<br>
> failure-timeout="60"<br>
> group queryserver qs_fs qs_ip qs_apache2 qs_rabbitmq<br>
><br>
><br>
><br>
> primitive qs1-stonith stonith:external/ipmi params hostname=qs1<br>
> ipaddr=172.16.10.134 userid=root passwd=blah interface=lan op start<br>
> interval=0s timeout=20s requires=nothing op monitor interval=600s<br>
> timeout=20s requires=nothing<br>
><br>
><br>
> primitive qs2-stonith stonith:external/ipmi params hostname=qs2<br>
> ipaddr=172.16.10.133 userid=root passwd=blah interface=lan op start<br>
> interval=0s timeout=20s requires=nothing op monitor interval=600s<br>
> timeout=20s requires=nothing<br>
><br>
><br>
><br>
> location l-st-qs1 qs1-stonith -inf: qs1<br>
> location l-st-qs2 qs2-stonith -inf: qs2<br>
> colocation queryserver_on_drbd inf: queryserver ms_drbd_qs:Master<br>
><br>
> order queryserver_after_drbd inf: ms_drbd_qs:promote queryserver:start<br>
><br>
><br>
> order ip_after_fs inf: qs_fs qs_ip<br>
> order apache_after_ip inf: qs_ip qs_apache2<br>
> order rabbitmq_after_ip inf: qs_ip qs_rabbitmq<br>
><br>
> verify<br>
> commit<br>
><br>
> Under normal operations, this is what I expect the cluster to look like:<br>
><br>
><br>
><br>
><br>
> # crm status<br>
> ============<br>
> Last updated: Tue Jan 26 11:55:50 2010<br>
> Current DC: qs1 - partition with quorum<br>
><br>
><br>
> 2 Nodes configured, 2 expected votes<br>
> 4 Resources configured.<br>
> ============<br>
><br>
> Online: [ qs1 qs2 ]<br>
><br>
><br>
> Master/Slave Set: ms_drbd_qs<br>
><br>
> Masters: [ qs1 ]<br>
> Slaves: [ qs2 ]<br>
> qs1-stonith (stonith:external/ipmi): Started qs2<br>
> qs2-stonith (stonith:external/ipmi): Started qs1<br>
> Resource Group: queryserver<br>
><br>
><br>
> qs_fs (ocf::heartbeat:Filesystem): Started qs1<br>
> qs_ip (ocf::heartbeat:IPaddr2): Started qs1<br>
> qs_apache2 (ocf::bering:apache2): Started qs1<br>
> qs_rabbitmq (ocf::bering:rabbitmq): Started qs1<br>
><br>
><br>
><br>
> If however a failure occurs, my configuration instructs pacemaker to put the<br>
> node in which the failure occurs into standby for 60 seconds:<br>
><br>
><br>
><br>
> # killall -9 rabbit<br>
><br>
> # crm status<br>
> ============<br>
> Last updated: Tue Jan 26 11:55:56 2010<br>
> Current DC: qs1 - partition with quorum<br>
><br>
><br>
> 2 Nodes configured, 2 expected votes<br>
> 4 Resources configured.<br>
> ============<br>
><br>
> Node qs1: standby (on-fail)<br>
><br>
><br>
> Online: [ qs2 ]<br>
><br>
> Master/Slave Set: ms_drbd_qs<br>
> Masters: [ qs1 ]<br>
> Slaves: [ qs2 ]<br>
> qs1-stonith (stonith:external/ipmi): Started qs2<br>
> qs2-stonith (stonith:external/ipmi): Started qs1<br>
><br>
><br>
> Resource Group: queryserver<br>
> qs_fs (ocf::heartbeat:Filesystem): Started qs1<br>
> qs_ip (ocf::heartbeat:IPaddr2): Started qs1<br>
> qs_apache2 (ocf::bering:apache2): Started qs1<br>
><br>
><br>
> qs_rabbitmq (ocf::bering:rabbitmq): Started qs1 FAILED<br>
><br>
> Failed actions:<br>
> qs_rabbitmq_monitor_60000 (node=qs1, call=32, rc=7, status=complete):<br>
> not running<br>
<br>
</div></div>This looks like the first problem.<br>
There shouldn't be anything running on qs1 at this point.<br>
<br>
Can you attach a hb_report archive for the interval covered by this test?<br>
That will contain everything I need to diagnose the problem.<br>
<div class="im"><br>
> After the 60 second timeout, I would expect the node to come back online,<br>
> and DRBD replication to resume, alas this is what I get:<br>
><br>
><br>
><br>
><br>
> # crm status<br>
> ============<br>
> Last updated: Tue Jan 26 11:58:36 2010<br>
> Current DC: qs1 - partition with quorum<br>
> 2 Nodes configured, 2 expected votes<br>
> 4 Resources configured.<br>
> ============<br>
><br>
> Online: [ qs1 qs2 ]<br>
><br>
><br>
><br>
> Master/Slave Set: ms_drbd_qs<br>
> Masters: [ qs2 ]<br>
> Stopped: [ drbd_qs:0 ]<br>
> qs1-stonith (stonith:external/ipmi): Started qs2<br>
> Resource Group: queryserver<br>
> qs_fs (ocf::heartbeat:Filesystem): Started qs2<br>
><br>
><br>
> qs_ip (ocf::heartbeat:IPaddr2): Started qs2<br>
> qs_apache2 (ocf::bering:apache2): Started qs2<br>
> qs_rabbitmq (ocf::bering:rabbitmq): Started qs2<br>
><br>
> DRBD fail-over works properly under certain conditions (eg: if I stop<br>
> corosync, powercycle the box, manual standby fail-over), however in the case<br>
> described above (one of the monitored services gets killed) leads to the<br>
> undesirable state of DRBD no longer replicating.<br>
><br>
> Does anyone have some ideas on what would need to be changed in the<br>
> pacemaker/corosync configuration for the node to come back online and go<br>
> from stopped to slave state?<br>
<br>
</div>You could be hitting a bug - I'll know more when you attach the hb_report.<br>
</blockquote></div><br>