[DRBD-user] [Openais] DRBD+pacemaker slave not coming back online after certain kinds of fail-over

Fri Jan 29 22:54:34 CET 2010

Hi Andrew,

I fixed it by removing the on-fail=standby and using
migration-threshold="1". It now behaves as we expect it to.

Unfortunately I've since re-imaged the test boxes, so no dice for hb_report.

If you do want to try to reproduce, I was running on Gentoo and installed
everything from source:

Corosync 1.1.2
openais 1.1.0
Cluster-Resource-Agents-4ac8bf7a64fe
Pacemaker-1-0-6695fd350a64
Reusable-Cluster-Components-2905a7843039
DRBD 8.3.6

Cheers,

Mark

On Fri, Jan 29, 2010 at 3:32 AM, Andrew Beekhof <andrew at beekhof.net> wrote:

> On Wed, Jan 27, 2010 at 12:45 AM, Mark Steele <msteele at beringmedia.com>
> wrote:
> > Hi folks,
> >
> > I've got a pacemaker cluster setup as follows:
> >
> > # crm configure
> > property no-quorum-policy=ignore
> > property stonith-enabled="true"
> >
> > primitive drbd_qs ocf:linbit:drbd params drbd_resource="r0" op monitor
> > interval="15s" op stop timeout=300s op start timeout=300s
> >
> >
> > ms ms_drbd_qs drbd_qs meta master-max="1" master-node-max="1"
> clone-max="2"
> > clone-node-max="1" notify="true"
> >
> > primitive qs_fs ocf:heartbeat:Filesystem params
> device="/dev/drbd/by-res/r0"
> > directory="/mnt/drbd1/" fstype="ext4"
> > options="barrier=0,noatime,nouser_xattr,data=writeback" op monitor
> > interval=30s OCF_CHECK_LEVEL=20 on-fail=standby meta failure-timeout="60"
> >
> >
> > primitive qs_ip ocf:heartbeat:IPaddr2 params ip="172.16.10.155"
> nic="eth0:0"
> > op monitor interval=60s on-fail=standby meta failure-timeout="60"
> > primitive qs_apache2 ocf:bering:apache2 op monitor interval=30s
> > on-fail=standby meta failure-timeout="60"
> >
> >
> > primitive qs_rabbitmq ocf:bering:rabbitmq op start timeout=120s op stop
> > timeout=300s op monitor interval=60s on-fail=standby meta
> > failure-timeout="60"
> > group queryserver qs_fs qs_ip qs_apache2 qs_rabbitmq
> >
> >
> >
> > primitive qs1-stonith stonith:external/ipmi params hostname=qs1
> > ipaddr=172.16.10.134 userid=root passwd=blah interface=lan op start
> > interval=0s timeout=20s requires=nothing op monitor interval=600s
> > timeout=20s requires=nothing
> >
> >
> > primitive qs2-stonith stonith:external/ipmi params hostname=qs2
> > ipaddr=172.16.10.133 userid=root passwd=blah interface=lan op start
> > interval=0s timeout=20s requires=nothing op monitor interval=600s
> > timeout=20s requires=nothing
> >
> >
> >
> > location l-st-qs1 qs1-stonith -inf: qs1
> > location l-st-qs2 qs2-stonith -inf: qs2
> > colocation queryserver_on_drbd inf: queryserver ms_drbd_qs:Master
> >
> > order queryserver_after_drbd inf: ms_drbd_qs:promote queryserver:start
> >
> >
> > order ip_after_fs inf: qs_fs qs_ip
> > order apache_after_ip inf: qs_ip qs_apache2
> > order rabbitmq_after_ip inf: qs_ip qs_rabbitmq
> >
> > verify
> > commit
> >
> > Under normal operations, this is what I expect the cluster to look like:
> >
> >
> >
> >
> > # crm status
> > ============
> > Last updated: Tue Jan 26 11:55:50 2010
> > Current DC: qs1 - partition with quorum
> >
> >
> > 2 Nodes configured, 2 expected votes
> > 4 Resources configured.
> > ============
> >
> > Online: [ qs1 qs2 ]
> >
> >
> >  Master/Slave Set: ms_drbd_qs
> >
> >      Masters: [ qs1 ]
> >      Slaves: [ qs2 ]
> >  qs1-stonith    (stonith:external/ipmi):        Started qs2
> >  qs2-stonith    (stonith:external/ipmi):        Started qs1
> >  Resource Group: queryserver
> >
> >
> >      qs_fs      (ocf::heartbeat:Filesystem):    Started qs1
> >      qs_ip      (ocf::heartbeat:IPaddr2):       Started qs1
> >      qs_apache2 (ocf::bering:apache2):  Started qs1
> >      qs_rabbitmq        (ocf::bering:rabbitmq): Started qs1
> >
> >
> >
> > If however a failure occurs, my configuration instructs pacemaker to put
> the
> > node in which the failure occurs into standby for 60 seconds:
> >
> >
> >
> > # killall -9 rabbit
> >
> > # crm status
> > ============
> > Last updated: Tue Jan 26 11:55:56 2010
> > Current DC: qs1 - partition with quorum
> >
> >
> > 2 Nodes configured, 2 expected votes
> > 4 Resources configured.
> > ============
> >
> > Node qs1: standby (on-fail)
> >
> >
> > Online: [ qs2 ]
> >
> >  Master/Slave Set: ms_drbd_qs
> >      Masters: [ qs1 ]
> >      Slaves: [ qs2 ]
> >  qs1-stonith    (stonith:external/ipmi):        Started qs2
> >  qs2-stonith    (stonith:external/ipmi):        Started qs1
> >
> >
> >  Resource Group: queryserver
> >      qs_fs      (ocf::heartbeat:Filesystem):    Started qs1
> >      qs_ip      (ocf::heartbeat:IPaddr2):       Started qs1
> >      qs_apache2 (ocf::bering:apache2):  Started qs1
> >
> >
> >      qs_rabbitmq        (ocf::bering:rabbitmq): Started qs1 FAILED
> >
> > Failed actions:
> >     qs_rabbitmq_monitor_60000 (node=qs1, call=32, rc=7, status=complete):
> > not running
>
> This looks like the first problem.
> There shouldn't be anything running on qs1 at this point.
>
> Can you attach a hb_report archive for the interval covered by this test?
> That will contain everything I need to diagnose the problem.
>
> > After the 60 second timeout, I would expect the node to come back online,
> > and DRBD replication to resume, alas this is what I get:
> >
> >
> >
> >
> > # crm status
> > ============
> > Last updated: Tue Jan 26 11:58:36 2010
> > Current DC: qs1 - partition with quorum
> > 2 Nodes configured, 2 expected votes
> > 4 Resources configured.
> > ============
> >
> > Online: [ qs1 qs2 ]
> >
> >
> >
> >  Master/Slave Set: ms_drbd_qs
> >      Masters: [ qs2 ]
> >      Stopped: [ drbd_qs:0 ]
> >  qs1-stonith    (stonith:external/ipmi):        Started qs2
> >  Resource Group: queryserver
> >      qs_fs      (ocf::heartbeat:Filesystem):    Started qs2
> >
> >
> >      qs_ip      (ocf::heartbeat:IPaddr2):       Started qs2
> >      qs_apache2 (ocf::bering:apache2):  Started qs2
> >      qs_rabbitmq        (ocf::bering:rabbitmq): Started qs2
> >
> > DRBD fail-over works properly under certain conditions (eg: if I stop
> > corosync, powercycle the box, manual standby fail-over), however in the
> case
> > described above (one of the monitored services gets killed) leads to the
> > undesirable state of DRBD no longer replicating.
> >
> > Does anyone have some ideas on what would need to be changed in the
> > pacemaker/corosync configuration for the node to come back online and go
> > from stopped to slave state?
>
> You could be hitting a bug - I'll know more when you attach the hb_report.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20100129/dd6e6e7b/attachment.htm>