[DRBD-user] New fence handler for RHCS; rhcs_fence

Mon Jan 16 03:38:50 CET 2012

On 01/15/2012 08:18 AM, Lars Ellenberg wrote:
> Some comments on where I think that script's logic
> is incomplete, still:
> 
> First, if you manage to get a simultaneous cluster crash,
> and then only one node comes back, you'll be offline,
> and need admin intervention to get online again.
> There is no easy way around that, though,
> so that is a common problem to all such setups.

This is a valid concern, but if I understand right, it's more a problem
of "both crash, one recovers". If both recover, drbd should reconnect
and do it's magic, never calling this script. Assuming that is correct,
and barring a suggestion on reliably determining that it's ok to go
UpToDate/Primary, I'd rather leave things hung for an admin to deal
with, given the alternative risk of data loss.

> # Features
> # - Clusters > 2 nodes supported, provided 
> 
> drbd.conf can have more than to "on $uname {}" or "floating $ip {}"
> sections per resource, to accomodate for a "floating" setup,
> i.e.  several nodes able to access the same data set,
> which may be a FC/iSCSI SAN, or a lower level DRBD in a stacked setup.
> 
> If you have exactly two such nodes, your assumptions should work.
> 
> If you have more than two such "on $uname {}" sections in drbd.conf,
> you need to be aware that:
> 
> 	# These are the environment variables set by DRBD. See 'man drbd.conf'
> 	# -> 'handlers'.
> 	env	=>	{
> 		# The resource triggering the fence.
> 		'DRBD_RESOURCE'		=>	$ENV{DRBD_RESOURCE},
> 		# The resource minor number.
> 		'DRBD_MINOR'		=>	$ENV{DRBD_MINOR},
> 		# This is 'ipv4' or 'ipv6'
> 		'DRBD_PEER_AF'		=>	$ENV{DRBD_PEER_AF},
> 		# The address of the peer(s).
> 		'DRBD_PEER_ADDRESS'	=>	$ENV{DRBD_PEER_ADDRESS},
> 
> DRBD_PEER_ADDRESS and _AF are both singular, and set to the currently
> configured peer, if any.  They may also be empty, if there is more than
> one potential peer, and none of them is currently configured.

I've deleted all but DRBD_RESOURCE and DRBD_PEERS now, as I wasn't using
either anyway.

> DRBD_PEERS, however, is plural,
> and will contain a space separated list of possible peer unames,
> or may be empty if that list could not be determined (maybe because
> DRBD_PEER_ADDRESS was not set).
> 
> 		# The peer(s) hostname(s)
> 		'DRBD_PEERS'		=>	$ENV{DRBD_PEERS},
> 	},

Ah, I was expecting 'DRBD_PEERS' to be only the one that went silent and
needed to be fenced, even in stacked. Ok, support for 2-node DRBD
(though in multi-node cluster) it shall be. If someone wants to see this
support stacked, they can contribute a patch. :)

> So you may want to document that your expectation is a classic two node
> DRBD configuration, even if those nodes may be part of a > 2 node cluster.

Done.

> 	# Example output showing what the bits mean.
> 	#        +--<  Current data generation UUID  >-
> 	#        |               +--<  Bitmap's base data generation UUID  >-
> 	#        |               |                 +--<  younger history UUID  >-
> 	#        |               |                 |         +-<  older history  >-
> 	#        V               V                 V         V
> 	# C3864FB60759430F:0000000000000000:A8C791FB53E8ED2B:A8C691FB53E8ED2B:1:1:1:1:0:0:0
> 	#                                                                     ^ ^ ^ ^ ^ ^ ^
> 	#                                       -<  Data consistency flag  >--+ | | | | | |
> 	#                              -<  Data was/is currently up-to-date  >--+ | | | | |
> 	#                                   -<  Node was/is currently primary  >--+ | | | |
> 	#                                   -<  Node was/is currently connected  >--+ | | |
> 	#          -<  Node was in the progress of setting all bits in the bitmap  >--+ | |
> 	#                         -<  The peer's disk was out-dated or inconsistent  >--+ |
> 	#       -<  This node was a crashed primary, and has not seen its peer since   >--+
> 	# 
> 	# flags: Primary, Connected, UpToDate
> 	
> 	# The sixth value will be 1 (UpToDate) or 0 (other).
> 	($conf->{sys}{local_res_uptodate}, $conf->{sys}{local_res_was_current_primary})=($status_line=~/.*?:.*?:.*?:.*?:\d:(\d):(\d):\d:\d:\d:\d/);
> 	
> 	to_log($conf, 0, __LINE__, "DEBUG: UpToDate: [$conf->{sys}{local_res_uptodate}]") if $conf->{sys}{debug};
> 	to_log($conf, 0, __LINE__, "DEBUG: Was Current Primary: [$conf->{sys}{local_res_was_current_primary}]") if $conf->{sys}{debug};
> 
> You want the current disk state of this resource,
> and refuse unless that reports UpToDate.

Done.

> This test is not sufficient:
> 	to_log($conf, 1, __LINE__,
> 		"Local resource: [$conf->{env}{DRBD_RESOURCE}] is NOT 'UpToDate',
> 		will not fence peer.")
> 		if not $conf->{sys}{local_res_uptodate};
> 
> It does not reflect the current state, but the state as stored in our "meta data flags".
> It will always say it "was" UpToDate,
> if it is Consistent, and does not *know* to be Outdated.
> Maybe it was more clear if we had the inverse logic,
> and named that flag "is certain to contain outdated data".
> 
> You are interested in a state not expressed here (and not easily
> possible to express in persistent meta data flags):
>     It is Consistent, it knows it *was* UpToDate,
>     neither self nor peer is marked Outdated:
>     it does not know yet if the peer has better data or not.
> 
> Also, in general I'd recommend to avoid calling drbdsetup,
> either explicitly, or implicitly via drbdadm, from a fence-peer-handler.
> In earlier drbd versions that would reliably timeout without response,
> it possibly would work now, but....
> 
> If you look at crm-fence-peer.sh,
> you'll notice that I grep the current state from /proc/drbd.

Changed to parse /proc/drbd, so the above concerns should not be in play
anymore.

> And this part of the logic is not good, explained below.
> 	to_log($conf, 1, __LINE__,
> 		"Local resource: [$conf->{env}{DRBD_RESOURCE}] was NOT 'Current Primary' and
> 		likely recovered from being fenced, will not fence peer.")
> 		if not $conf->{sys}{local_res_was_current_primary};
> 
> Scenario:
>   All good, replicating, but only *one* Primary now.
>   Primary node crash/power outage/whatever.
> 
>   Cluster wants to now promote the remaining, still Secondary node.
> 
>   Secondary, when promoted without established replication link,
>   will call the fence-peer handler during promotion.
> 
>   Your handler will "fail", because it is not marked "current primary",
>   causing the promotion to be aborted.

Relying now of local disk state only, this should also now be a
non-issue, I believe.

> Rest of the logic looked OK at first glance.
> 
> Thank you.
> 
> Would you rather keep this separately,
> or should we start distributing this
> (or some later revision of it) with DRBD?

As soon as you think it's well enough flushed out, I would be happy to
see it included directly with DRBD itself.

Changes pushed to github. I plan to make some changes as per fabionne's
suggestions shortly, but 06a126f315b2f1cbcf2bc7485507815266d34
(https://github.com/digimer/rhcs_fence/commit/06a126f315b2f1cbcf2bc7485507815266d34926)
reflects your feedback.

Cheers!

-- 
Digimer
E-Mail:              digimer at alteeve.com
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron