[DRBD-user] DRBD resource fenced by crm-fence-peer.sh with exit code 5

Tue Jun 17 13:42:28 CEST 2014

On Mon, Jun 16, 2014 at 05:18:07PM -0500, Andrew Martin wrote:
> Hi Lars,
> 
> ----- Original Message -----
> > From: "Lars Ellenberg" <lars.ellenberg at linbit.com>
> > To: drbd-user at lists.linbit.com
> > Sent: Friday, June 13, 2014 9:46:12 AM
> > Subject: Re: [DRBD-user] DRBD resource fenced by crm-fence-peer.sh with exit code 5
> > > During testing, I've tried shutting down the currently-active node. When
> > > doing
> > > so, the fence peer handler inserts the constraint correctly, but it exits
> > > with
> > > exit code 5:
> > > INFO peer is not reachable, my disk is UpToDate: placed constraint
> > > 'drbd-fence-by-handler-ms_drbd_drives'
> > 
> > "Shutting down", is in how?
> > Do you first cut the replication link, while still being primary?
> > Well, that *of course* will prevent the other node from being promoted.
> > That's exactly what this is supposed to do if a Primary loses the
> > replication link.
> I'm issuing the "reboot" command on the currently-primary node. I would expect
> this to gracefully stop DRBD and transfer control over to the other node? I'd
> like to simulate an accidental reboot of the server, a hardware failure, and
> a kernel panic to verify that the other node can take over if the current
> primary fails under any of these conditions.

People expect a lot of things... sometimes they are right.
Don't assume too much, but double check what actually happens.
And then see where and why that diverges from your expectations.

> > > crm-fence-peer.sh exit codes:
> > > http://www.drbd.org/users-guide-8.3/s-fence-peer.html
> > > 
> > > I can see this constraint in the CIB, however, the remaining (still
> > > secondary)
> > > node fails to promote.
> > 
> > Yes. Because that constraint tells it to not become Master.
> > 
> > > Moreover, when the original node is powered back on, it
> > > repeatedly attempts to remove the constraint by calling
> > > crm-unfence-peer.sh,
> > 
> > Is that so.
> > I don't see why it would do that.
> > the crm unfence should be called only by the after-resync-target handler,
> > so you would need to have a resync, be sync target, and finish that
> > resync successfully.
> I actually have a wrapper script configured in /etc/drbd.conf which does some
> additional logging:
> fence-peer "/usr/local/bin/fence-peer crm-fence-peer";
> after-resync-target "/usr/local/bin/fence-peer crm-unfence-peer";
> 
> In this wrapper script, I record the arguments and then call either
> crm-fence-peer.sh or crm-unfence-peer.sh:
> echo "Calling with $*" >> $LOG
> 
> # fence the peer
> /usr/lib/drbd/$1.sh $@ >> $LOG 2>> $LOG
> 
> I can tail the log and see these being printed frequently after restarting
> restarting the primary node:
> Calling with crm-unfence-peer
> Calling with crm-unfence-peer
> Calling with crm-unfence-peer
> Calling with crm-unfence-peer
> Calling with crm-unfence-peer
> Calling with crm-unfence-peer
> 
> I can update the script to include timestamps as well if that would be
> helpful.

Everytime a DRBD SyncTarget becomes Connected UpToDate, this
after-resync-target handler will be called, once for each resource,
in fact once for each volume in each resource,
even after "empty" resyncs, and even if there is nothing to do:
the module cannot know that your arbitrary handler has nothing to do...

You should rather look into the other logs, both pacemaker and kernel,
and figure out why you get "so many" resyncs,
e.g. if pacemaker is constantly "recovering" from some "failure" by
cycling through demote/stop/start...  and if so, why ...

Or ask LINBIT to figure it out for you ;-)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed