[Drbd-dev] Another drbd race

Lars Ellenberg lars.ellenberg at linbit.com
Tue Sep 7 14:05:02 CEST 2004


On Tue, Sep 07, 2004 at 01:32:02PM +0200, Philipp Reisner wrote:
> > I would like to introduce an additional Node state for the o_state:
> > Dead. it is never "recognized" internally, but can be set by the
> > operator or cluster manager. basically, if we go to WhatEver/Unknown,
> > we don't accept anything (since we don't want to risk split brain).
> > some higher authority can and needs to resolve this, telling us the peer
> > is dead (after a successfull stonith, when we are Secondary and shall be
> > promoted).
> >
> >
> > now we have this:
> >   P/S --- S/P
> >   P/? -:- S/?
> >
> >  A)
> >   if this is in fact (from the pov of heartbeat)
> >   P/? -.. XXX
> >     we stonith it (just to be sure) and tell it "peer dead"
> >   P/D -..
> >     (and there it resumes).
> >
> >  B)
> >   if this is in fact (from the pov of heartbeat)
> >   P/? XXX S/?
> >     - we do nothing
> >       (blocks until network is fixed again)
> >     - we tell S that it is outdated,
> >       then tell P to resume
> >     - or we make it (by STONITH) into either A or C
> >
> >  C)
> >   if this is in fact (from the pov of heartbeat)
> >   XXX ..- S/?
> >     we stonith it (just to be sure) and tell it "peer dead"
> >   XXX ..- S/D
> >     (and there it accepts to be promoted again).
> >
> >
> > similar after bootup:
> >   we refuse to be promoted to Primary from Secondary/Unknown,
> >   unless we got an explicit "peer dead" confirmation by someone.
> >
> > does that make any sense?
> >
> 
> I like it a lot!
> 
> Thus we will not call it "drbdadm resume-io r0" but 
> "drbdadm peer-dead r0"
> 
> I think the assertion that the peer is dead 
> (short "peer-dead")  is a lot easier to understand than
> a "resume-io" command.
> 
> 
> Also the question at the startup-user-dialog: 
> 
> Is the peer dead ? 
> 
> Is easier to get right....


maybe we still need to have this a two-stage process:
after reboot, and we remain in Secondary/Unknown,
we need to be told "peer dead", but we also need to get the confirmation
"up-to-date" (just to cover our ass).

when it was just a connection loss, we *are* up-to-date, and just need the
confirmation "peer dead"; or we get the confirmation "link dead, peer
alive", which basically is "you are outdated!".

just so we cannot be blamed for "automatically losing transactions",
even in a multiple failure scenario.

	lge


More information about the drbd-dev mailing list