[DRBD-user] DRBD+pacemaker slave not coming back online after certain kinds of fail-over

Wed Jan 27 00:45:10 CET 2010

Hi folks,

I've got a pacemaker cluster setup as follows:

# crm configure
property no-quorum-policy=ignore
property stonith-enabled="true"

primitive drbd_qs ocf:linbit:drbd params drbd_resource="r0" op monitor
interval="15s" op stop timeout=300s op start timeout=300s

ms ms_drbd_qs drbd_qs meta master-max="1" master-node-max="1"
clone-max="2" clone-node-max="1" notify="true"

primitive qs_fs ocf:heartbeat:Filesystem params
device="/dev/drbd/by-res/r0" directory="/mnt/drbd1/" fstype="ext4"
options="barrier=0,noatime,nouser_xattr,data=writeback" op monitor
interval=30s OCF_CHECK_LEVEL=20 on-fail=standby meta
failure-timeout="60"

primitive qs_ip ocf:heartbeat:IPaddr2 params ip="172.16.10.155"
nic="eth0:0" op monitor interval=60s on-fail=standby meta
failure-timeout="60"
primitive qs_apache2 ocf:bering:apache2 op monitor interval=30s
on-fail=standby meta failure-timeout="60"

primitive qs_rabbitmq ocf:bering:rabbitmq op start timeout=120s op
stop timeout=300s op monitor interval=60s on-fail=standby meta
failure-timeout="60"
group queryserver qs_fs qs_ip qs_apache2 qs_rabbitmq

primitive qs1-stonith stonith:external/ipmi params hostname=qs1
ipaddr=172.16.10.134 userid=root passwd=blah interface=lan op start
interval=0s timeout=20s requires=nothing op monitor interval=600s
timeout=20s requires=nothing

primitive qs2-stonith stonith:external/ipmi params hostname=qs2
ipaddr=172.16.10.133 userid=root passwd=blah interface=lan op start
interval=0s timeout=20s requires=nothing op monitor interval=600s
timeout=20s requires=nothing

location l-st-qs1 qs1-stonith -inf: qs1
location l-st-qs2 qs2-stonith -inf: qs2
colocation queryserver_on_drbd inf: queryserver ms_drbd_qs:Master

order queryserver_after_drbd inf: ms_drbd_qs:promote queryserver:start

order ip_after_fs inf: qs_fs qs_ip
order apache_after_ip inf: qs_ip qs_apache2
order rabbitmq_after_ip inf: qs_ip qs_rabbitmq

verify
commit

Under normal operations, this is what I expect the cluster to look like:

# crm status
============
Last updated: Tue Jan 26 11:55:50 2010
Current DC: qs1 - partition with quorum

2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ qs1 qs2 ]

 Master/Slave Set: ms_drbd_qs

     Masters: [ qs1 ]
     Slaves: [ qs2 ]
 qs1-stonith    (stonith:external/ipmi):        Started qs2
 qs2-stonith    (stonith:external/ipmi):        Started qs1
 Resource Group: queryserver

     qs_fs      (ocf::heartbeat:Filesystem):    Started qs1
     qs_ip      (ocf::heartbeat:IPaddr2):       Started qs1
     qs_apache2 (ocf::bering:apache2):  Started qs1
     qs_rabbitmq        (ocf::bering:rabbitmq): Started qs1

If however a failure occurs, my configuration instructs pacemaker to
put the node in which the failure occurs into standby for 60 seconds:

# killall -9 rabbit

# crm status
============
Last updated: Tue Jan 26 11:55:56 2010
Current DC: qs1 - partition with quorum

2 Nodes configured, 2 expected votes
4 Resources configured.
============

Node qs1: standby (on-fail)

Online: [ qs2 ]

 Master/Slave Set: ms_drbd_qs
     Masters: [ qs1 ]
     Slaves: [ qs2 ]
 qs1-stonith    (stonith:external/ipmi):        Started qs2
 qs2-stonith    (stonith:external/ipmi):        Started qs1

 Resource Group: queryserver
     qs_fs      (ocf::heartbeat:Filesystem):    Started qs1
     qs_ip      (ocf::heartbeat:IPaddr2):       Started qs1
     qs_apache2 (ocf::bering:apache2):  Started qs1

     qs_rabbitmq        (ocf::bering:rabbitmq): Started qs1 FAILED

Failed actions:
    qs_rabbitmq_monitor_60000 (node=qs1, call=32, rc=7,
status=complete): not running

After the 60 second timeout, I would expect the node to come back
online, and DRBD replication to resume, alas this is what I get:

# crm status
============
Last updated: Tue Jan 26 11:58:36 2010
Current DC: qs1 - partition with quorum
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ qs1 qs2 ]

 Master/Slave Set: ms_drbd_qs
     Masters: [ qs2 ]
     Stopped: [ drbd_qs:0 ]
 qs1-stonith    (stonith:external/ipmi):        Started qs2
 Resource Group: queryserver
     qs_fs      (ocf::heartbeat:Filesystem):    Started qs2

     qs_ip      (ocf::heartbeat:IPaddr2):       Started qs2
     qs_apache2 (ocf::bering:apache2):  Started qs2
     qs_rabbitmq        (ocf::bering:rabbitmq): Started qs2

DRBD fail-over works properly under certain conditions (eg: if I stop
corosync, powercycle the box, manual standby fail-over), however in the case
described above (one of the monitored services gets killed) leads to the
undesirable state of DRBD no longer replicating.

Does anyone have some ideas on what would need to be changed in the
pacemaker/corosync configuration for the node to come back online and go
from stopped to slave state?

Thanks,

-- 
Mark Steele
Director of development
Bering Media Inc.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20100126/60ce5e83/attachment.htm>