Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Andrew, I fixed it by removing the on-fail=standby and using migration-threshold="1". It now behaves as we expect it to. Unfortunately I've since re-imaged the test boxes, so no dice for hb_report. If you do want to try to reproduce, I was running on Gentoo and installed everything from source: Corosync 1.1.2 openais 1.1.0 Cluster-Resource-Agents-4ac8bf7a64fe Pacemaker-1-0-6695fd350a64 Reusable-Cluster-Components-2905a7843039 DRBD 8.3.6 Cheers, Mark On Fri, Jan 29, 2010 at 3:32 AM, Andrew Beekhof <andrew at beekhof.net> wrote: > On Wed, Jan 27, 2010 at 12:45 AM, Mark Steele <msteele at beringmedia.com> > wrote: > > Hi folks, > > > > I've got a pacemaker cluster setup as follows: > > > > # crm configure > > property no-quorum-policy=ignore > > property stonith-enabled="true" > > > > primitive drbd_qs ocf:linbit:drbd params drbd_resource="r0" op monitor > > interval="15s" op stop timeout=300s op start timeout=300s > > > > > > ms ms_drbd_qs drbd_qs meta master-max="1" master-node-max="1" > clone-max="2" > > clone-node-max="1" notify="true" > > > > primitive qs_fs ocf:heartbeat:Filesystem params > device="/dev/drbd/by-res/r0" > > directory="/mnt/drbd1/" fstype="ext4" > > options="barrier=0,noatime,nouser_xattr,data=writeback" op monitor > > interval=30s OCF_CHECK_LEVEL=20 on-fail=standby meta failure-timeout="60" > > > > > > primitive qs_ip ocf:heartbeat:IPaddr2 params ip="172.16.10.155" > nic="eth0:0" > > op monitor interval=60s on-fail=standby meta failure-timeout="60" > > primitive qs_apache2 ocf:bering:apache2 op monitor interval=30s > > on-fail=standby meta failure-timeout="60" > > > > > > primitive qs_rabbitmq ocf:bering:rabbitmq op start timeout=120s op stop > > timeout=300s op monitor interval=60s on-fail=standby meta > > failure-timeout="60" > > group queryserver qs_fs qs_ip qs_apache2 qs_rabbitmq > > > > > > > > primitive qs1-stonith stonith:external/ipmi params hostname=qs1 > > ipaddr=172.16.10.134 userid=root passwd=blah interface=lan op start > > interval=0s timeout=20s requires=nothing op monitor interval=600s > > timeout=20s requires=nothing > > > > > > primitive qs2-stonith stonith:external/ipmi params hostname=qs2 > > ipaddr=172.16.10.133 userid=root passwd=blah interface=lan op start > > interval=0s timeout=20s requires=nothing op monitor interval=600s > > timeout=20s requires=nothing > > > > > > > > location l-st-qs1 qs1-stonith -inf: qs1 > > location l-st-qs2 qs2-stonith -inf: qs2 > > colocation queryserver_on_drbd inf: queryserver ms_drbd_qs:Master > > > > order queryserver_after_drbd inf: ms_drbd_qs:promote queryserver:start > > > > > > order ip_after_fs inf: qs_fs qs_ip > > order apache_after_ip inf: qs_ip qs_apache2 > > order rabbitmq_after_ip inf: qs_ip qs_rabbitmq > > > > verify > > commit > > > > Under normal operations, this is what I expect the cluster to look like: > > > > > > > > > > # crm status > > ============ > > Last updated: Tue Jan 26 11:55:50 2010 > > Current DC: qs1 - partition with quorum > > > > > > 2 Nodes configured, 2 expected votes > > 4 Resources configured. > > ============ > > > > Online: [ qs1 qs2 ] > > > > > > Master/Slave Set: ms_drbd_qs > > > > Masters: [ qs1 ] > > Slaves: [ qs2 ] > > qs1-stonith (stonith:external/ipmi): Started qs2 > > qs2-stonith (stonith:external/ipmi): Started qs1 > > Resource Group: queryserver > > > > > > qs_fs (ocf::heartbeat:Filesystem): Started qs1 > > qs_ip (ocf::heartbeat:IPaddr2): Started qs1 > > qs_apache2 (ocf::bering:apache2): Started qs1 > > qs_rabbitmq (ocf::bering:rabbitmq): Started qs1 > > > > > > > > If however a failure occurs, my configuration instructs pacemaker to put > the > > node in which the failure occurs into standby for 60 seconds: > > > > > > > > # killall -9 rabbit > > > > # crm status > > ============ > > Last updated: Tue Jan 26 11:55:56 2010 > > Current DC: qs1 - partition with quorum > > > > > > 2 Nodes configured, 2 expected votes > > 4 Resources configured. > > ============ > > > > Node qs1: standby (on-fail) > > > > > > Online: [ qs2 ] > > > > Master/Slave Set: ms_drbd_qs > > Masters: [ qs1 ] > > Slaves: [ qs2 ] > > qs1-stonith (stonith:external/ipmi): Started qs2 > > qs2-stonith (stonith:external/ipmi): Started qs1 > > > > > > Resource Group: queryserver > > qs_fs (ocf::heartbeat:Filesystem): Started qs1 > > qs_ip (ocf::heartbeat:IPaddr2): Started qs1 > > qs_apache2 (ocf::bering:apache2): Started qs1 > > > > > > qs_rabbitmq (ocf::bering:rabbitmq): Started qs1 FAILED > > > > Failed actions: > > qs_rabbitmq_monitor_60000 (node=qs1, call=32, rc=7, status=complete): > > not running > > This looks like the first problem. > There shouldn't be anything running on qs1 at this point. > > Can you attach a hb_report archive for the interval covered by this test? > That will contain everything I need to diagnose the problem. > > > After the 60 second timeout, I would expect the node to come back online, > > and DRBD replication to resume, alas this is what I get: > > > > > > > > > > # crm status > > ============ > > Last updated: Tue Jan 26 11:58:36 2010 > > Current DC: qs1 - partition with quorum > > 2 Nodes configured, 2 expected votes > > 4 Resources configured. > > ============ > > > > Online: [ qs1 qs2 ] > > > > > > > > Master/Slave Set: ms_drbd_qs > > Masters: [ qs2 ] > > Stopped: [ drbd_qs:0 ] > > qs1-stonith (stonith:external/ipmi): Started qs2 > > Resource Group: queryserver > > qs_fs (ocf::heartbeat:Filesystem): Started qs2 > > > > > > qs_ip (ocf::heartbeat:IPaddr2): Started qs2 > > qs_apache2 (ocf::bering:apache2): Started qs2 > > qs_rabbitmq (ocf::bering:rabbitmq): Started qs2 > > > > DRBD fail-over works properly under certain conditions (eg: if I stop > > corosync, powercycle the box, manual standby fail-over), however in the > case > > described above (one of the monitored services gets killed) leads to the > > undesirable state of DRBD no longer replicating. > > > > Does anyone have some ideas on what would need to be changed in the > > pacemaker/corosync configuration for the node to come back online and go > > from stopped to slave state? > > You could be hitting a bug - I'll know more when you attach the hb_report. > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20100129/dd6e6e7b/attachment.htm>