[Drbd-dev] Another drbd race
Lars Ellenberg
lars.ellenberg at linbit.com
Tue Sep 7 14:05:02 CEST 2004
On Tue, Sep 07, 2004 at 01:32:02PM +0200, Philipp Reisner wrote:
> > I would like to introduce an additional Node state for the o_state:
> > Dead. it is never "recognized" internally, but can be set by the
> > operator or cluster manager. basically, if we go to WhatEver/Unknown,
> > we don't accept anything (since we don't want to risk split brain).
> > some higher authority can and needs to resolve this, telling us the peer
> > is dead (after a successfull stonith, when we are Secondary and shall be
> > promoted).
> >
> >
> > now we have this:
> > P/S --- S/P
> > P/? -:- S/?
> >
> > A)
> > if this is in fact (from the pov of heartbeat)
> > P/? -.. XXX
> > we stonith it (just to be sure) and tell it "peer dead"
> > P/D -..
> > (and there it resumes).
> >
> > B)
> > if this is in fact (from the pov of heartbeat)
> > P/? XXX S/?
> > - we do nothing
> > (blocks until network is fixed again)
> > - we tell S that it is outdated,
> > then tell P to resume
> > - or we make it (by STONITH) into either A or C
> >
> > C)
> > if this is in fact (from the pov of heartbeat)
> > XXX ..- S/?
> > we stonith it (just to be sure) and tell it "peer dead"
> > XXX ..- S/D
> > (and there it accepts to be promoted again).
> >
> >
> > similar after bootup:
> > we refuse to be promoted to Primary from Secondary/Unknown,
> > unless we got an explicit "peer dead" confirmation by someone.
> >
> > does that make any sense?
> >
>
> I like it a lot!
>
> Thus we will not call it "drbdadm resume-io r0" but
> "drbdadm peer-dead r0"
>
> I think the assertion that the peer is dead
> (short "peer-dead") is a lot easier to understand than
> a "resume-io" command.
>
>
> Also the question at the startup-user-dialog:
>
> Is the peer dead ?
>
> Is easier to get right....
maybe we still need to have this a two-stage process:
after reboot, and we remain in Secondary/Unknown,
we need to be told "peer dead", but we also need to get the confirmation
"up-to-date" (just to cover our ass).
when it was just a connection loss, we *are* up-to-date, and just need the
confirmation "peer dead"; or we get the confirmation "link dead, peer
alive", which basically is "you are outdated!".
just so we cannot be blamed for "automatically losing transactions",
even in a multiple failure scenario.
lge
More information about the drbd-dev
mailing list