Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 16/03/14 11:14 AM, Lars Ellenberg wrote: > On Fri, Mar 14, 2014 at 10:44:54AM -0400, Digimer wrote: >> On 14/03/14 05:34 AM, khaled atteya wrote: >>> A- In DRBD Users's guide , in explanation of "resource-only" which one >>> of fencing policy , they said: >>> >>> "If a node becomes a disconnected primary, it tries to fence the >>> peer's disk. This is done by calling the fence-peer handler. The handler >>> is supposed to reach the >>> other node over alternative communication paths and call 'drbdadm >>> outdate minor' there." >>> >>> My question is : if the handler can't reach the other node for any >>> reason ,what will happen ? >> >> I always use 'resource-and-stonith', which blocks until the fence >> action was a success. As for the fence handler, I always pass the >> requests up to the cluster manager. To do this, I use 'rhcs_fence' >> on Red Hat clusters (cman + rgmanager) or crm-fence-peer.sh on >> corosync + pacemaker clusters. >> >> In either case, the fence action does not try to log into the other >> node. Instead, it uses an external device, like IPMI or PDUs, and >> forces the node off. > > crm-fence-peer.sh *DOES NOT* fence (as in node-level fence aka stonith) the other node. > It "fences" the other node by setting a pacemaker constraint > to prevent the other node from being, respectively becoming, Primary. > > It tries hard to detect whether the replication link loss > was a node failure (in which case pacemaker will notice and stonith > anyways) or only some problem with the replication tcp connection, > in which case the other node is still reachable via cluster > communications and will notice and respect the new constraint. > > If it seems to be a node level failure, it tries to wait until the cib > reflects it as "confirmed and expected down" (successful stonith). > There are various timeouts to modify the exact behaviour. > > That script contains massive shell comments, > documenting its intended usage and functionality. > > In a dual-primary setup, if it was a replication link failure only, > and cluster communication is still up, both will call that handler, > but only one will succeed to set the constraint. The other will remain > IO-blocked, and can optionally "commit suicide" from inside the handler. Ooooooh, I misunderstood this. Thanks for clarifying! > Just because you where able to shoot the other node does not make your > data any better. > > In a situation, where you only use node level fencing from inside > this handler, the other node would boot back up, and if configured > to start cluster services by default, could start pacemaker, not see the > other node, do startup-fencing (shoot the still live Primary), and > conclude from being able to shoot the other node that its own, outdated, > version of the data would be good enough to go online. > Unlikely, but possible. > > Which is why in this scenario, you should not start up cluster services > if you cannot see the peer, or at least refuse to shoot from the > DRBD fence-peer handler if your local disk state is only "Consistent" > (which it is after bringing up DRBD, if configured for such fencing, > if it cannot communicate with its peer). > > So to be "good", you need both: the node level fencing, > and the drbd level fencing. > >>> B- In active/passive mode , are these directives have effect: >>> Are these directives "after-sb-0pri , after-sb-1pri , after-sb-2pri" >>> have effects in Active/passive mode or only in Active/Active mode ? >>> If they have effects , what if i don't set them , is their default value >>> for each ? >> >> It doesn't matter what mode you are in, it matters what happened >> during the time that the nodes were split-brained. If both nodes >> were secondary during the split-brain, 0pri policy is used. If one >> node was Primary and the other remained secondary, 1pri policy is >> used. If both nodes were primary, even for a short time, 2pri is >> used. > > > The 0, 1 and 2 Primary is counting Primaries > at the moment of the DRBD handshake. > > (If one had been secondary all along, > we would not have data divergence, > only "fast-forwardable" outdated-ness.) > > Which means that if you happen to have a funky multi-failure scenario, > and end up doing the drbd handshake where the one with the better data > is Secondary, the one with the not-so-good data is Primary, and you have > configured "discard Secondary", you will be very disappointed. > > All such auto-recovery strategies (but zero-changes) are automating data loss. > So you should better be sure to mean that. > >> The reason the policy doesn't matter so much is because the roles >> matter, not how they got there. For example, if you or someone else >> assumed the old primary was dead and manually promoted the >> secondary, you have a two-primary split-brain, despite the normal >> mode of operation. >> >>> C- can I use SBD fencing with drbd+pacemaker rather than IPMI or PDU? >> >> No, I do not believe so. The reason being that if the nodes >> split-brain, both will think they have access to the "SAN" storage. >> Where as with a real (external) SAN, it's possible to say "only one >> node is allowed to talk and the other is blocked. There is no way >> for one node to block access to the other node's local DRBD data. >> >> IPMI/PDU fencing is certainly the way to go. > > You cannot use the replication device as its own fencing mechanism. > That's a dependency loop. > You can still use DRBD, and SBD, but you would have a different, > actually shared IO medium for the SBD, independend from your DRBD setup. > > Of course you can also put an SBD on an iSCSI export from a different > DRBD cluster, as long as that other iSCSI + DRBD cluster is properly > setup to avoid data divergence there under any circumstances... > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?