Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Wed, Jan 27, 2010 at 12:45 AM, Mark Steele <msteele at beringmedia.com> wrote: > Hi folks, > > I've got a pacemaker cluster setup as follows: > > # crm configure > property no-quorum-policy=ignore > property stonith-enabled="true" > > primitive drbd_qs ocf:linbit:drbd params drbd_resource="r0" op monitor > interval="15s" op stop timeout=300s op start timeout=300s > > > ms ms_drbd_qs drbd_qs meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > > primitive qs_fs ocf:heartbeat:Filesystem params device="/dev/drbd/by-res/r0" > directory="/mnt/drbd1/" fstype="ext4" > options="barrier=0,noatime,nouser_xattr,data=writeback" op monitor > interval=30s OCF_CHECK_LEVEL=20 on-fail=standby meta failure-timeout="60" > > > primitive qs_ip ocf:heartbeat:IPaddr2 params ip="172.16.10.155" nic="eth0:0" > op monitor interval=60s on-fail=standby meta failure-timeout="60" > primitive qs_apache2 ocf:bering:apache2 op monitor interval=30s > on-fail=standby meta failure-timeout="60" > > > primitive qs_rabbitmq ocf:bering:rabbitmq op start timeout=120s op stop > timeout=300s op monitor interval=60s on-fail=standby meta > failure-timeout="60" > group queryserver qs_fs qs_ip qs_apache2 qs_rabbitmq > > > > primitive qs1-stonith stonith:external/ipmi params hostname=qs1 > ipaddr=172.16.10.134 userid=root passwd=blah interface=lan op start > interval=0s timeout=20s requires=nothing op monitor interval=600s > timeout=20s requires=nothing > > > primitive qs2-stonith stonith:external/ipmi params hostname=qs2 > ipaddr=172.16.10.133 userid=root passwd=blah interface=lan op start > interval=0s timeout=20s requires=nothing op monitor interval=600s > timeout=20s requires=nothing > > > > location l-st-qs1 qs1-stonith -inf: qs1 > location l-st-qs2 qs2-stonith -inf: qs2 > colocation queryserver_on_drbd inf: queryserver ms_drbd_qs:Master > > order queryserver_after_drbd inf: ms_drbd_qs:promote queryserver:start > > > order ip_after_fs inf: qs_fs qs_ip > order apache_after_ip inf: qs_ip qs_apache2 > order rabbitmq_after_ip inf: qs_ip qs_rabbitmq > > verify > commit > > Under normal operations, this is what I expect the cluster to look like: > > > > > # crm status > ============ > Last updated: Tue Jan 26 11:55:50 2010 > Current DC: qs1 - partition with quorum > > > 2 Nodes configured, 2 expected votes > 4 Resources configured. > ============ > > Online: [ qs1 qs2 ] > > > Master/Slave Set: ms_drbd_qs > > Masters: [ qs1 ] > Slaves: [ qs2 ] > qs1-stonith (stonith:external/ipmi): Started qs2 > qs2-stonith (stonith:external/ipmi): Started qs1 > Resource Group: queryserver > > > qs_fs (ocf::heartbeat:Filesystem): Started qs1 > qs_ip (ocf::heartbeat:IPaddr2): Started qs1 > qs_apache2 (ocf::bering:apache2): Started qs1 > qs_rabbitmq (ocf::bering:rabbitmq): Started qs1 > > > > If however a failure occurs, my configuration instructs pacemaker to put the > node in which the failure occurs into standby for 60 seconds: > > > > # killall -9 rabbit > > # crm status > ============ > Last updated: Tue Jan 26 11:55:56 2010 > Current DC: qs1 - partition with quorum > > > 2 Nodes configured, 2 expected votes > 4 Resources configured. > ============ > > Node qs1: standby (on-fail) > > > Online: [ qs2 ] > > Master/Slave Set: ms_drbd_qs > Masters: [ qs1 ] > Slaves: [ qs2 ] > qs1-stonith (stonith:external/ipmi): Started qs2 > qs2-stonith (stonith:external/ipmi): Started qs1 > > > Resource Group: queryserver > qs_fs (ocf::heartbeat:Filesystem): Started qs1 > qs_ip (ocf::heartbeat:IPaddr2): Started qs1 > qs_apache2 (ocf::bering:apache2): Started qs1 > > > qs_rabbitmq (ocf::bering:rabbitmq): Started qs1 FAILED > > Failed actions: > qs_rabbitmq_monitor_60000 (node=qs1, call=32, rc=7, status=complete): > not running This looks like the first problem. There shouldn't be anything running on qs1 at this point. Can you attach a hb_report archive for the interval covered by this test? That will contain everything I need to diagnose the problem. > After the 60 second timeout, I would expect the node to come back online, > and DRBD replication to resume, alas this is what I get: > > > > > # crm status > ============ > Last updated: Tue Jan 26 11:58:36 2010 > Current DC: qs1 - partition with quorum > 2 Nodes configured, 2 expected votes > 4 Resources configured. > ============ > > Online: [ qs1 qs2 ] > > > > Master/Slave Set: ms_drbd_qs > Masters: [ qs2 ] > Stopped: [ drbd_qs:0 ] > qs1-stonith (stonith:external/ipmi): Started qs2 > Resource Group: queryserver > qs_fs (ocf::heartbeat:Filesystem): Started qs2 > > > qs_ip (ocf::heartbeat:IPaddr2): Started qs2 > qs_apache2 (ocf::bering:apache2): Started qs2 > qs_rabbitmq (ocf::bering:rabbitmq): Started qs2 > > DRBD fail-over works properly under certain conditions (eg: if I stop > corosync, powercycle the box, manual standby fail-over), however in the case > described above (one of the monitored services gets killed) leads to the > undesirable state of DRBD no longer replicating. > > Does anyone have some ideas on what would need to be changed in the > pacemaker/corosync configuration for the node to come back online and go > from stopped to slave state? You could be hitting a bug - I'll know more when you attach the hb_report.