[DRBD-user] Fencing & split brain related questions

Digimer lists at alteeve.ca
Sun Mar 16 18:03:03 CET 2014

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

On 16/03/14 11:14 AM, Lars Ellenberg wrote:
> On Fri, Mar 14, 2014 at 10:44:54AM -0400, Digimer wrote:
>> On 14/03/14 05:34 AM, khaled atteya wrote:
>>> A- In DRBD Users's guide , in explanation of "resource-only" which one
>>> of fencing policy , they said:
>>> "If a node becomes a disconnected primary, it tries to fence the
>>> peer's disk. This is done by calling the fence-peer handler. The handler
>>> is supposed to reach the
>>> other node over alternative communication paths and call 'drbdadm
>>> outdate minor' there."
>>> My question is : if the handler can't reach the other node for any
>>> reason ,what will happen ?
>> I always use 'resource-and-stonith', which blocks until the fence
>> action was a success. As for the fence handler, I always pass the
>> requests up to the cluster manager. To do this, I use 'rhcs_fence'
>> on Red Hat clusters (cman + rgmanager) or crm-fence-peer.sh on
>> corosync + pacemaker clusters.
>> In either case, the fence action does not try to log into the other
>> node. Instead, it uses an external device, like IPMI or PDUs, and
>> forces the node off.
> crm-fence-peer.sh *DOES NOT* fence (as in node-level fence aka stonith) the other node.
> It "fences" the other node by setting a pacemaker constraint
> to prevent the other node from being, respectively becoming, Primary.
> It tries hard to detect whether the replication link loss
> was a node failure (in which case pacemaker will notice and stonith
> anyways) or only some problem with the replication tcp connection,
> in which case the other node is still reachable via cluster
> communications and will notice and respect the new constraint.
> If it seems to be a node level failure, it tries to wait until the cib
> reflects it as "confirmed and expected down" (successful stonith).
> There are various timeouts to modify the exact behaviour.
> That script contains massive shell comments,
> documenting its intended usage and functionality.
> In a dual-primary setup, if it was a replication link failure only,
> and cluster communication is still up, both will call that handler,
> but only one will succeed to set the constraint.  The other will remain
> IO-blocked, and can optionally "commit suicide" from inside the handler.

Ooooooh, I misunderstood this. Thanks for clarifying!

> Just because you where able to shoot the other node does not make your
> data any better.
> In a situation, where you only use node level fencing from inside
> this handler, the other node would boot back up, and if configured
> to start cluster services by default, could start pacemaker, not see the
> other node, do startup-fencing (shoot the still live Primary), and
> conclude from being able to shoot the other node that its own, outdated,
> version of the data would be good enough to go online.
> Unlikely, but possible.
> Which is why in this scenario, you should not start up cluster services
> if you cannot see the peer, or at least refuse to shoot from the
> DRBD fence-peer handler if your local disk state is only "Consistent"
> (which it is after bringing up DRBD, if configured for such fencing,
> if it cannot communicate with its peer).
> So to be "good", you need both: the node level fencing,
> and the drbd level fencing.
>>> B- In active/passive mode , are these directives have effect:
>>> Are these directives "after-sb-0pri , after-sb-1pri  , after-sb-2pri"
>>> have effects in Active/passive mode or only in Active/Active mode ?
>>> If they have effects , what if i don't set them , is their default value
>>> for each ?
>> It doesn't matter what mode you are in, it matters what happened
>> during the time that the nodes were split-brained. If both nodes
>> were secondary during the split-brain, 0pri policy is used. If one
>> node was Primary and the other remained secondary, 1pri policy is
>> used. If both nodes were primary, even for a short time, 2pri is
>> used.
> The 0, 1 and 2 Primary is counting Primaries
> at the moment of the DRBD handshake.
> (If one had been secondary all along,
> we would not have data divergence,
> only "fast-forwardable" outdated-ness.)
> Which means that if you happen to have a funky multi-failure scenario,
> and end up doing the drbd handshake where the one with the better data
> is Secondary, the one with the not-so-good data is Primary, and you have
> configured "discard Secondary", you will be very disappointed.
> All such auto-recovery strategies (but zero-changes) are automating data loss.
> So you should better be sure to mean that.
>> The reason the policy doesn't matter so much is because the roles
>> matter, not how they got there. For example, if you or someone else
>> assumed the old primary was dead and manually promoted the
>> secondary, you have a two-primary split-brain, despite the normal
>> mode of operation.
>>> C- can I use SBD fencing with drbd+pacemaker rather than IPMI or PDU?
>> No, I do not believe so. The reason being that if the nodes
>> split-brain, both will think they have access to the "SAN" storage.
>> Where as with a real (external) SAN, it's possible to say "only one
>> node is allowed to talk and the other is blocked. There is no way
>> for one node to block access to the other node's local DRBD data.
>> IPMI/PDU fencing is certainly the way to go.
> You cannot use the replication device as its own fencing mechanism.
> That's a dependency loop.
> You can still use DRBD, and SBD, but you would have a different,
> actually shared IO medium for the SBD, independend from your DRBD setup.
> Of course you can also put an SBD on an iSCSI export from a different
> DRBD cluster, as long as that other iSCSI + DRBD cluster is properly
> setup to avoid data divergence there under any circumstances...

Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?

More information about the drbd-user mailing list