[DRBD-user] Fencing & split brain related questions

Sun Mar 16 16:14:15 CET 2014

On Fri, Mar 14, 2014 at 10:44:54AM -0400, Digimer wrote:
> On 14/03/14 05:34 AM, khaled atteya wrote:
> >A- In DRBD Users's guide , in explanation of "resource-only" which one
> >of fencing policy , they said:
> >
> >"If a node becomes a disconnected primary, it tries to fence the
> >peer's disk. This is done by calling the fence-peer handler. The handler
> >is supposed to reach the
> >other node over alternative communication paths and call 'drbdadm
> >outdate minor' there."
> >
> >My question is : if the handler can't reach the other node for any
> >reason ,what will happen ?
> 
> I always use 'resource-and-stonith', which blocks until the fence
> action was a success. As for the fence handler, I always pass the
> requests up to the cluster manager. To do this, I use 'rhcs_fence'
> on Red Hat clusters (cman + rgmanager) or crm-fence-peer.sh on
> corosync + pacemaker clusters.
> 
> In either case, the fence action does not try to log into the other
> node. Instead, it uses an external device, like IPMI or PDUs, and
> forces the node off.

crm-fence-peer.sh *DOES NOT* fence (as in node-level fence aka stonith) the other node.
It "fences" the other node by setting a pacemaker constraint
to prevent the other node from being, respectively becoming, Primary.

It tries hard to detect whether the replication link loss
was a node failure (in which case pacemaker will notice and stonith
anyways) or only some problem with the replication tcp connection,
in which case the other node is still reachable via cluster
communications and will notice and respect the new constraint.

If it seems to be a node level failure, it tries to wait until the cib
reflects it as "confirmed and expected down" (successful stonith).
There are various timeouts to modify the exact behaviour.

That script contains massive shell comments,
documenting its intended usage and functionality.

In a dual-primary setup, if it was a replication link failure only,
and cluster communication is still up, both will call that handler,
but only one will succeed to set the constraint.  The other will remain
IO-blocked, and can optionally "commit suicide" from inside the handler.

Just because you where able to shoot the other node does not make your
data any better.

In a situation, where you only use node level fencing from inside
this handler, the other node would boot back up, and if configured
to start cluster services by default, could start pacemaker, not see the
other node, do startup-fencing (shoot the still live Primary), and
conclude from being able to shoot the other node that its own, outdated,
version of the data would be good enough to go online.
Unlikely, but possible.

Which is why in this scenario, you should not start up cluster services
if you cannot see the peer, or at least refuse to shoot from the
DRBD fence-peer handler if your local disk state is only "Consistent"
(which it is after bringing up DRBD, if configured for such fencing,
if it cannot communicate with its peer).

So to be "good", you need both: the node level fencing,
and the drbd level fencing.

> >B- In active/passive mode , are these directives have effect:
> >Are these directives "after-sb-0pri , after-sb-1pri  , after-sb-2pri"
> >have effects in Active/passive mode or only in Active/Active mode ?
> >If they have effects , what if i don't set them , is their default value
> >for each ?
> 
> It doesn't matter what mode you are in, it matters what happened
> during the time that the nodes were split-brained. If both nodes
> were secondary during the split-brain, 0pri policy is used. If one
> node was Primary and the other remained secondary, 1pri policy is
> used. If both nodes were primary, even for a short time, 2pri is
> used.

The 0, 1 and 2 Primary is counting Primaries
at the moment of the DRBD handshake.

(If one had been secondary all along,
we would not have data divergence,
only "fast-forwardable" outdated-ness.)

Which means that if you happen to have a funky multi-failure scenario,
and end up doing the drbd handshake where the one with the better data
is Secondary, the one with the not-so-good data is Primary, and you have
configured "discard Secondary", you will be very disappointed.

All such auto-recovery strategies (but zero-changes) are automating data loss.
So you should better be sure to mean that.

> The reason the policy doesn't matter so much is because the roles
> matter, not how they got there. For example, if you or someone else
> assumed the old primary was dead and manually promoted the
> secondary, you have a two-primary split-brain, despite the normal
> mode of operation.
> 
> >C- can I use SBD fencing with drbd+pacemaker rather than IPMI or PDU?
> 
> No, I do not believe so. The reason being that if the nodes
> split-brain, both will think they have access to the "SAN" storage.
> Where as with a real (external) SAN, it's possible to say "only one
> node is allowed to talk and the other is blocked. There is no way
> for one node to block access to the other node's local DRBD data.
>
> IPMI/PDU fencing is certainly the way to go.

You cannot use the replication device as its own fencing mechanism.
That's a dependency loop.
You can still use DRBD, and SBD, but you would have a different,
actually shared IO medium for the SBD, independend from your DRBD setup.

Of course you can also put an SBD on an iSCSI export from a different
DRBD cluster, as long as that other iSCSI + DRBD cluster is properly
setup to avoid data divergence there under any circumstances...

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed