[DRBD-user] Consistent device to primary fences remote node

Lars Ellenberg lars.ellenberg at linbit.com
Thu Nov 27 17:38:39 CET 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Thu, Nov 27, 2008 at 03:24:51PM +0100, Federico Simoncelli wrote:
> On Thu, Nov 27, 2008 at 2:47 PM, Lars Ellenberg
> <lars.ellenberg at linbit.com> wrote:
> >> So basically the problem is that my outdate-peer doesn't try to
> >> outdate the remote peer but it just fences it.
> >> To fix this behaviour I could modify the outdate-peer handler to check
> >> the DRBD_RESOURCE dstate and return the exit code 6 if local resource
> >> is not "UpToDate".
> >>
> >> What do you think? Comments are welcome.
> >
> > if you return 6,
> > drbd will (try to) outdate itself as a side effect.
> 
> This was what I was trying to accomplish. Basically my idea to avoid
> split-brain is:
> 
> Scenario 1:
> 
> 1) both servers: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate
> 2) server 2 is correctly shut down
> 3) server 1: cs:WFConnection st:Primary/Unknown ds:UpToDate/Outdated
> 4) booting server 2 in StandAlone mode is impossible since it has Outdated data

careful. you use your own outdate-peer handler.
so, server-1 "knows".
but does server-2 know that it is outdated?
who outdated it?
when?

> Scenario 2:
> 
> 1) both servers: cs:Connected st:Primary/Primary ds:UpToDate/UpToDate
> 2) server 2 is incorrectly shut down (fence/power loss), resource
> remains in "Consistent" status (not "UpToDate")
> 3) server 1: cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown
> 4) booting server 2 in StandAlone mode is impossible since the
> outdate-peer handler returns with exit code 6 when local resource is
> not "UpToDate".
> 
> Basically the Consistent status will always be turned into Outdated
> because I have no way to check if the remote peer is primary. I assume
> that a peer with "Consistent" status was incorrectly shut down and
> can't become primary without manual intervention.

you can now no longer reboot a single primary,
whether cleanly or by power reset.

because that assumption is "wrong":
Both Outdated and UpToDate are sub aspects of Consistent.

if no drbd fencing policy is configured ("DontCare"?),
drbd assumes Consistent == UpToDate.

but if there is some drbd fencing policy configured, then only the drbd
handshake, or the outdate-peer handler via exit code can disambiguate.

> If both nodes are incorrectly shut down they both end in "Consistent"
> status. At the next boot they'll both outdate their local resource and
> manual intervention is required to choose the most updated resource.
> 
> What do you think? Comments are welcome.
> 
> > why don't you just set a high initial wait for connection timeout?
> >   wfc-timeout 172800;
> > if within two days no-one came and told me that I'm outdated,
> > and I still cannot reach the other node, I have all right to assume I'm
> > the only survivor and allowed to become primary.
> 
> I don't like the idea of a server waiting for a couple of days in the
> boot sequence as a general rule and in this particular situation even
> more since I moved the drbd script early at the beginning  before
> clvmd.
> Stopping the booting sequence for 2 days means I wouldn't be able to
> remotely log in.

of course network and sshd have to be up first.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list