Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi folks, I've got a pacemaker cluster setup as follows: # crm configure property no-quorum-policy=ignore property stonith-enabled="true" primitive drbd_qs ocf:linbit:drbd params drbd_resource="r0" op monitor interval="15s" op stop timeout=300s op start timeout=300s ms ms_drbd_qs drbd_qs meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" primitive qs_fs ocf:heartbeat:Filesystem params device="/dev/drbd/by-res/r0" directory="/mnt/drbd1/" fstype="ext4" options="barrier=0,noatime,nouser_xattr,data=writeback" op monitor interval=30s OCF_CHECK_LEVEL=20 on-fail=standby meta failure-timeout="60" primitive qs_ip ocf:heartbeat:IPaddr2 params ip="172.16.10.155" nic="eth0:0" op monitor interval=60s on-fail=standby meta failure-timeout="60" primitive qs_apache2 ocf:bering:apache2 op monitor interval=30s on-fail=standby meta failure-timeout="60" primitive qs_rabbitmq ocf:bering:rabbitmq op start timeout=120s op stop timeout=300s op monitor interval=60s on-fail=standby meta failure-timeout="60" group queryserver qs_fs qs_ip qs_apache2 qs_rabbitmq primitive qs1-stonith stonith:external/ipmi params hostname=qs1 ipaddr=172.16.10.134 userid=root passwd=blah interface=lan op start interval=0s timeout=20s requires=nothing op monitor interval=600s timeout=20s requires=nothing primitive qs2-stonith stonith:external/ipmi params hostname=qs2 ipaddr=172.16.10.133 userid=root passwd=blah interface=lan op start interval=0s timeout=20s requires=nothing op monitor interval=600s timeout=20s requires=nothing location l-st-qs1 qs1-stonith -inf: qs1 location l-st-qs2 qs2-stonith -inf: qs2 colocation queryserver_on_drbd inf: queryserver ms_drbd_qs:Master order queryserver_after_drbd inf: ms_drbd_qs:promote queryserver:start order ip_after_fs inf: qs_fs qs_ip order apache_after_ip inf: qs_ip qs_apache2 order rabbitmq_after_ip inf: qs_ip qs_rabbitmq verify commit Under normal operations, this is what I expect the cluster to look like: # crm status ============ Last updated: Tue Jan 26 11:55:50 2010 Current DC: qs1 - partition with quorum 2 Nodes configured, 2 expected votes 4 Resources configured. ============ Online: [ qs1 qs2 ] Master/Slave Set: ms_drbd_qs Masters: [ qs1 ] Slaves: [ qs2 ] qs1-stonith (stonith:external/ipmi): Started qs2 qs2-stonith (stonith:external/ipmi): Started qs1 Resource Group: queryserver qs_fs (ocf::heartbeat:Filesystem): Started qs1 qs_ip (ocf::heartbeat:IPaddr2): Started qs1 qs_apache2 (ocf::bering:apache2): Started qs1 qs_rabbitmq (ocf::bering:rabbitmq): Started qs1 If however a failure occurs, my configuration instructs pacemaker to put the node in which the failure occurs into standby for 60 seconds: # killall -9 rabbit # crm status ============ Last updated: Tue Jan 26 11:55:56 2010 Current DC: qs1 - partition with quorum 2 Nodes configured, 2 expected votes 4 Resources configured. ============ Node qs1: standby (on-fail) Online: [ qs2 ] Master/Slave Set: ms_drbd_qs Masters: [ qs1 ] Slaves: [ qs2 ] qs1-stonith (stonith:external/ipmi): Started qs2 qs2-stonith (stonith:external/ipmi): Started qs1 Resource Group: queryserver qs_fs (ocf::heartbeat:Filesystem): Started qs1 qs_ip (ocf::heartbeat:IPaddr2): Started qs1 qs_apache2 (ocf::bering:apache2): Started qs1 qs_rabbitmq (ocf::bering:rabbitmq): Started qs1 FAILED Failed actions: qs_rabbitmq_monitor_60000 (node=qs1, call=32, rc=7, status=complete): not running After the 60 second timeout, I would expect the node to come back online, and DRBD replication to resume, alas this is what I get: # crm status ============ Last updated: Tue Jan 26 11:58:36 2010 Current DC: qs1 - partition with quorum 2 Nodes configured, 2 expected votes 4 Resources configured. ============ Online: [ qs1 qs2 ] Master/Slave Set: ms_drbd_qs Masters: [ qs2 ] Stopped: [ drbd_qs:0 ] qs1-stonith (stonith:external/ipmi): Started qs2 Resource Group: queryserver qs_fs (ocf::heartbeat:Filesystem): Started qs2 qs_ip (ocf::heartbeat:IPaddr2): Started qs2 qs_apache2 (ocf::bering:apache2): Started qs2 qs_rabbitmq (ocf::bering:rabbitmq): Started qs2 DRBD fail-over works properly under certain conditions (eg: if I stop corosync, powercycle the box, manual standby fail-over), however in the case described above (one of the monitored services gets killed) leads to the undesirable state of DRBD no longer replicating. Does anyone have some ideas on what would need to be changed in the pacemaker/corosync configuration for the node to come back online and go from stopped to slave state? Thanks, -- Mark Steele Director of development Bering Media Inc. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20100126/60ce5e83/attachment.htm>