[DRBD-user] [Openais] DRBD+pacemaker slave not coming back online after certain kinds of fail-over

Fri Jan 29 09:32:05 CET 2010

On Wed, Jan 27, 2010 at 12:45 AM, Mark Steele <msteele at beringmedia.com> wrote:
> Hi folks,
>
> I've got a pacemaker cluster setup as follows:
>
> # crm configure
> property no-quorum-policy=ignore
> property stonith-enabled="true"
>
> primitive drbd_qs ocf:linbit:drbd params drbd_resource="r0" op monitor
> interval="15s" op stop timeout=300s op start timeout=300s
>
>
> ms ms_drbd_qs drbd_qs meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
>
> primitive qs_fs ocf:heartbeat:Filesystem params device="/dev/drbd/by-res/r0"
> directory="/mnt/drbd1/" fstype="ext4"
> options="barrier=0,noatime,nouser_xattr,data=writeback" op monitor
> interval=30s OCF_CHECK_LEVEL=20 on-fail=standby meta failure-timeout="60"
>
>
> primitive qs_ip ocf:heartbeat:IPaddr2 params ip="172.16.10.155" nic="eth0:0"
> op monitor interval=60s on-fail=standby meta failure-timeout="60"
> primitive qs_apache2 ocf:bering:apache2 op monitor interval=30s
> on-fail=standby meta failure-timeout="60"
>
>
> primitive qs_rabbitmq ocf:bering:rabbitmq op start timeout=120s op stop
> timeout=300s op monitor interval=60s on-fail=standby meta
> failure-timeout="60"
> group queryserver qs_fs qs_ip qs_apache2 qs_rabbitmq
>
>
>
> primitive qs1-stonith stonith:external/ipmi params hostname=qs1
> ipaddr=172.16.10.134 userid=root passwd=blah interface=lan op start
> interval=0s timeout=20s requires=nothing op monitor interval=600s
> timeout=20s requires=nothing
>
>
> primitive qs2-stonith stonith:external/ipmi params hostname=qs2
> ipaddr=172.16.10.133 userid=root passwd=blah interface=lan op start
> interval=0s timeout=20s requires=nothing op monitor interval=600s
> timeout=20s requires=nothing
>
>
>
> location l-st-qs1 qs1-stonith -inf: qs1
> location l-st-qs2 qs2-stonith -inf: qs2
> colocation queryserver_on_drbd inf: queryserver ms_drbd_qs:Master
>
> order queryserver_after_drbd inf: ms_drbd_qs:promote queryserver:start
>
>
> order ip_after_fs inf: qs_fs qs_ip
> order apache_after_ip inf: qs_ip qs_apache2
> order rabbitmq_after_ip inf: qs_ip qs_rabbitmq
>
> verify
> commit
>
> Under normal operations, this is what I expect the cluster to look like:
>
>
>
>
> # crm status
> ============
> Last updated: Tue Jan 26 11:55:50 2010
> Current DC: qs1 - partition with quorum
>
>
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> ============
>
> Online: [ qs1 qs2 ]
>
>
>  Master/Slave Set: ms_drbd_qs
>
>      Masters: [ qs1 ]
>      Slaves: [ qs2 ]
>  qs1-stonith    (stonith:external/ipmi):        Started qs2
>  qs2-stonith    (stonith:external/ipmi):        Started qs1
>  Resource Group: queryserver
>
>
>      qs_fs      (ocf::heartbeat:Filesystem):    Started qs1
>      qs_ip      (ocf::heartbeat:IPaddr2):       Started qs1
>      qs_apache2 (ocf::bering:apache2):  Started qs1
>      qs_rabbitmq        (ocf::bering:rabbitmq): Started qs1
>
>
>
> If however a failure occurs, my configuration instructs pacemaker to put the
> node in which the failure occurs into standby for 60 seconds:
>
>
>
> # killall -9 rabbit
>
> # crm status
> ============
> Last updated: Tue Jan 26 11:55:56 2010
> Current DC: qs1 - partition with quorum
>
>
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> ============
>
> Node qs1: standby (on-fail)
>
>
> Online: [ qs2 ]
>
>  Master/Slave Set: ms_drbd_qs
>      Masters: [ qs1 ]
>      Slaves: [ qs2 ]
>  qs1-stonith    (stonith:external/ipmi):        Started qs2
>  qs2-stonith    (stonith:external/ipmi):        Started qs1
>
>
>  Resource Group: queryserver
>      qs_fs      (ocf::heartbeat:Filesystem):    Started qs1
>      qs_ip      (ocf::heartbeat:IPaddr2):       Started qs1
>      qs_apache2 (ocf::bering:apache2):  Started qs1
>
>
>      qs_rabbitmq        (ocf::bering:rabbitmq): Started qs1 FAILED
>
> Failed actions:
>     qs_rabbitmq_monitor_60000 (node=qs1, call=32, rc=7, status=complete):
> not running

This looks like the first problem.
There shouldn't be anything running on qs1 at this point.

Can you attach a hb_report archive for the interval covered by this test?
That will contain everything I need to diagnose the problem.

> After the 60 second timeout, I would expect the node to come back online,
> and DRBD replication to resume, alas this is what I get:
>
>
>
>
> # crm status
> ============
> Last updated: Tue Jan 26 11:58:36 2010
> Current DC: qs1 - partition with quorum
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> ============
>
> Online: [ qs1 qs2 ]
>
>
>
>  Master/Slave Set: ms_drbd_qs
>      Masters: [ qs2 ]
>      Stopped: [ drbd_qs:0 ]
>  qs1-stonith    (stonith:external/ipmi):        Started qs2
>  Resource Group: queryserver
>      qs_fs      (ocf::heartbeat:Filesystem):    Started qs2
>
>
>      qs_ip      (ocf::heartbeat:IPaddr2):       Started qs2
>      qs_apache2 (ocf::bering:apache2):  Started qs2
>      qs_rabbitmq        (ocf::bering:rabbitmq): Started qs2
>
> DRBD fail-over works properly under certain conditions (eg: if I stop
> corosync, powercycle the box, manual standby fail-over), however in the case
> described above (one of the monitored services gets killed) leads to the
> undesirable state of DRBD no longer replicating.
>
> Does anyone have some ideas on what would need to be changed in the
> pacemaker/corosync configuration for the node to come back online and go
> from stopped to slave state?

You could be hitting a bug - I'll know more when you attach the hb_report.