[DRBD-user] New fence handler for RHCS; rhcs_fence

Sun Jan 15 14:18:21 CET 2012

On Sun, Jan 15, 2012 at 12:48:05AM -0500, Digimer wrote:
> Hi all,
> 
>   I spoke to Lon, the author of obliterate-peer.sh, about
> updating/rewriting his script to add a few features. From that, I
> decided to use perl as it's the language I am most comfortable with, so
> I did a full rewrite.
> 
>   I changed the name to 'rhcs_fence', and the source is available here:
> 
> https://github.com/digimer/rhcs_fence

Some comments on where I think that script's logic
is incomplete, still:

First, if you manage to get a simultaneous cluster crash,
and then only one node comes back, you'll be offline,
and need admin intervention to get online again.
There is no easy way around that, though,
so that is a common problem to all such setups.

# Features
# - Clusters > 2 nodes supported, provided 

drbd.conf can have more than to "on $uname {}" or "floating $ip {}"
sections per resource, to accomodate for a "floating" setup,
i.e.  several nodes able to access the same data set,
which may be a FC/iSCSI SAN, or a lower level DRBD in a stacked setup.

If you have exactly two such nodes, your assumptions should work.

If you have more than two such "on $uname {}" sections in drbd.conf,
you need to be aware that:

	# These are the environment variables set by DRBD. See 'man drbd.conf'
	# -> 'handlers'.
	env	=>	{
		# The resource triggering the fence.
		'DRBD_RESOURCE'		=>	$ENV{DRBD_RESOURCE},
		# The resource minor number.
		'DRBD_MINOR'		=>	$ENV{DRBD_MINOR},
		# This is 'ipv4' or 'ipv6'
		'DRBD_PEER_AF'		=>	$ENV{DRBD_PEER_AF},
		# The address of the peer(s).
		'DRBD_PEER_ADDRESS'	=>	$ENV{DRBD_PEER_ADDRESS},

DRBD_PEER_ADDRESS and _AF are both singular, and set to the currently
configured peer, if any.  They may also be empty, if there is more than
one potential peer, and none of them is currently configured.

DRBD_PEERS, however, is plural,
and will contain a space separated list of possible peer unames,
or may be empty if that list could not be determined (maybe because
DRBD_PEER_ADDRESS was not set).

		# The peer(s) hostname(s)
		'DRBD_PEERS'		=>	$ENV{DRBD_PEERS},
	},

So you may want to document that your expectation is a classic two node
DRBD configuration, even if those nodes may be part of a > 2 node cluster.

	# Example output showing what the bits mean.
	#        +--<  Current data generation UUID  >-
	#        |               +--<  Bitmap's base data generation UUID  >-
	#        |               |                 +--<  younger history UUID  >-
	#        |               |                 |         +-<  older history  >-
	#        V               V                 V         V
	# C3864FB60759430F:0000000000000000:A8C791FB53E8ED2B:A8C691FB53E8ED2B:1:1:1:1:0:0:0
	#                                                                     ^ ^ ^ ^ ^ ^ ^
	#                                       -<  Data consistency flag  >--+ | | | | | |
	#                              -<  Data was/is currently up-to-date  >--+ | | | | |
	#                                   -<  Node was/is currently primary  >--+ | | | |
	#                                   -<  Node was/is currently connected  >--+ | | |
	#          -<  Node was in the progress of setting all bits in the bitmap  >--+ | |
	#                         -<  The peer's disk was out-dated or inconsistent  >--+ |
	#       -<  This node was a crashed primary, and has not seen its peer since   >--+
	# 
	# flags: Primary, Connected, UpToDate

	# The sixth value will be 1 (UpToDate) or 0 (other).
	($conf->{sys}{local_res_uptodate}, $conf->{sys}{local_res_was_current_primary})=($status_line=~/.*?:.*?:.*?:.*?:\d:(\d):(\d):\d:\d:\d:\d/);

	to_log($conf, 0, __LINE__, "DEBUG: UpToDate: [$conf->{sys}{local_res_uptodate}]") if $conf->{sys}{debug};
	to_log($conf, 0, __LINE__, "DEBUG: Was Current Primary: [$conf->{sys}{local_res_was_current_primary}]") if $conf->{sys}{debug};

You want the current disk state of this resource,
and refuse unless that reports UpToDate.

This test is not sufficient:
	to_log($conf, 1, __LINE__,
		"Local resource: [$conf->{env}{DRBD_RESOURCE}] is NOT 'UpToDate',
		will not fence peer.")
		if not $conf->{sys}{local_res_uptodate};

It does not reflect the current state, but the state as stored in our "meta data flags".
It will always say it "was" UpToDate,
if it is Consistent, and does not *know* to be Outdated.
Maybe it was more clear if we had the inverse logic,
and named that flag "is certain to contain outdated data".

You are interested in a state not expressed here (and not easily
possible to express in persistent meta data flags):
    It is Consistent, it knows it *was* UpToDate,
    neither self nor peer is marked Outdated:
    it does not know yet if the peer has better data or not.

Also, in general I'd recommend to avoid calling drbdsetup,
either explicitly, or implicitly via drbdadm, from a fence-peer-handler.
In earlier drbd versions that would reliably timeout without response,
it possibly would work now, but....

If you look at crm-fence-peer.sh,
you'll notice that I grep the current state from /proc/drbd.

And this part of the logic is not good, explained below.
	to_log($conf, 1, __LINE__,
		"Local resource: [$conf->{env}{DRBD_RESOURCE}] was NOT 'Current Primary' and
		likely recovered from being fenced, will not fence peer.")
		if not $conf->{sys}{local_res_was_current_primary};

Scenario:
  All good, replicating, but only *one* Primary now.
  Primary node crash/power outage/whatever.

  Cluster wants to now promote the remaining, still Secondary node.

  Secondary, when promoted without established replication link,
  will call the fence-peer handler during promotion.

  Your handler will "fail", because it is not marked "current primary",
  causing the promotion to be aborted.

Rest of the logic looked OK at first glance.

Thank you.

Would you rather keep this separately,
or should we start distributing this
(or some later revision of it) with DRBD?

>   The main differences are:
> 
> - No longer restricted to 2-node clusters. So long as DRBD_PEERS is set
> to a name found in 'cman_tool', the proper node will be fenced.
> 
> - More sanity checks are made to help minimize the risk of dual-fencing.
> First, dynamic or configurable delays help ensure both nodes won't try
> to simultaneously fence one another. Also, when a fenced node recovers
> but still can't reach it's peer, it will not try to fence the surviving
> peer which helps avoid fence loops.
> 
> - Improved fence call; Rather than a simple sequence of calls to
> fence_node, this script now waits for the output and verifies that the
> fence call succeeded. This avoids spurious error messages from being
> printed to syslog.
> 
>   Being on github, I welcome and
> feedback/improvements/suggestions/critique. Just ask and I'll be happy
> to give commit access. Testing will be greatly appreciated!
> 
> -- 
> Digimer
> E-Mail:              digimer at alteeve.com
> Freenode handle:     digimer
> Papers and Projects: http://alteeve.com
> Node Assassin:       http://nodeassassin.org
> "omg my singularity battery is dead again.
> stupid hawking radiation." - epitron
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.