[Drbd-dev] Another drbd race
Philipp Reisner
philipp.reisner at linbit.com
Tue Sep 7 13:32:02 CEST 2004
On Tuesday 07 September 2004 12:13, Lars Ellenberg wrote:
> On Tue, Sep 07, 2004 at 11:39:29AM +0200, Philipp Reisner wrote:
> > On Saturday 04 September 2004 12:00, Lars Ellenberg wrote:
> > > On Sat, Sep 04, 2004 at 11:48:14AM +0200, Lars Marowsky-Bree wrote:
> > > > Hi,
> > > >
> > > > lge and I have yesterday discussed a 'new' drbd race condition and
> > > > also touched on its resolution.
> > > >
> > > > Scope: in a split-brain, drbd might confirm write to the clients and
> > > > might on a subsequent failover lose the transactions which _have been
> > > > confirmed_. This is not acceptable.
> > > >
> > > > Sequence:
> > > >
> > > > Step N1 Link N2
> > > > 1 P ok S
> > > > 2 P breaks S node1 notices, goes into stand alone,
> > > > stops waiting for N2 to confirm.
> > > > 3 P broken S S notices, initiates fencing
> > > > 4 x broken P N2 becomes primary
> > > >
> > > > Writes which have been done in between step 2-4 will have been
> > > > confirmed to the higher layers, but are not actually available on N2.
> > > > This is data loss; N2 is still consistent, but lost confirmed
> > > > transaction.
> > > >
> > > > Partially, this is solved by the Oracle-requested "only ever confirm
> > > > if committed to both nodes", but of course then if it's not a broken
> > > > link, but N2 really went down, we'd be blocking on N1 forever, which
> > > > we don't want to do for HA.
> > > >
> > > > So, here's the new sequence to solve this:
> > > >
> > > > Step N1 Link N2
> > > > 1 P ok S
> > > > 2 P(blk) ok X P blocks waiting for acks; heartbeat
> > > > notices that it has lost N2, and initiates
> > > > fencing.
> > > > 3 P(blk) ok fenced heartbeat tells drbd on N1 that yes, we
> > > > know it's dead, we fenced it, no point
> > > > waiting.
> > > > 4 P ok fenced Cluster proceeds to run.
> > > >
> > > > Now, in this super-safe mode, if now N1 also fails after step 3 but
> > > > before N2 comes back up and is resynced, we need to make sure that N2
> > > > does refuse to become primary itself. This will probably require
> > > > additional magic in the cluster manager to handle correctly, but N2
> > > > needs an additional flag to prevent this from happening by accident.
> > > >
> > > > Lars?
> > >
> > > I think we can do this detection already with the combination of the
> > > Consistent and Connected as well as HaveBeenPrimary flag. Only the
> > > logic needs to be built in.
> >
> > I do not want to "misuse" the Consistent Bit for this.
> >
> > !Consistent .... means that we are in the middle of a sync.
> > = data is not usable at all.
> > Fenced .... our data is 100% okay, but not the latest copy.
>
> lets call it "Outdated"
>
> my idea is that a crashed Secondary will come up as !Primary|Connected, so
> it can assume it is outdated. (similar to the choice about wfc-degr...)
>
> we can only possibly lose write transaction in the very moment we
> promote a Secondary to Primary. until we do that, and the harddisk where
> the transactions have been written to is still physically intact, the
> data is still there, though maybe not available.
>
> we can try to make sure that we never promote a Secondary that possibly
> (or knowingly) is outdated.
>
> see below.
>
> > > Most likely right after connection loss the Primary should blocks for a
> > > configurable (default: infinity?) amount of time before giving end_io
> > > events back to the upper layer.
> > > We then need to be able to tell it to resume operation (we can do this,
> > > as soon as we took precautions to prevent the Secondary to become
> > > Primary without being forced or resynced before).
> > >
> > > Or, if the cluster decides to do so, the Secondary has time to STONITH
> > > the Primary (while that is still blocking) and take over.
> > >
> > > I want to include a timeout, so the cluster manager don't need to
> > > know about "peer is dead" notification, it only needs to know about
> > > STONITH.
> >
> > I see. Makes sense, but on the other hand STONITH (more genral:
> > FENCING) might fail, as LMB points out in one of the other mails.
> >
> > -> We should probabely _not_ offer a timeout here, as soon as
> > "on-disconnect freeze_io;" is set, it is freezed forever.
> > Or it gets a "drbdadm resume-io r0" from the cluster manager.
> >
> > > Maybe we want to introduce this functionality as a new wire protocoll,
> > > or only in proto C.
> >
> > I see it controled by the
> >
> > "on-disconnect freeze_io;" option.
> >
> > For N2 we need a "drbdadm fence-off r0" command and for N1 we need
> > a "drbdadm resume-io r0".
> >
> > * The fenced bit gets cleard when the resync is finished.
> > * A node refuses to become primary when the fenced bit is set.
> > * "drbdadm -- --do-what-I-say primary r0" overrules (and cleares?)
> > the fenced bit
> >
> > To be defined: What should we do at node startup with the fenced bit.
> > (At least display it at the user-dialog)
>
> I would like to introduce an additional Node state for the o_state:
> Dead. it is never "recognized" internally, but can be set by the
> operator or cluster manager. basically, if we go to WhatEver/Unknown,
> we don't accept anything (since we don't want to risk split brain).
> some higher authority can and needs to resolve this, telling us the peer
> is dead (after a successfull stonith, when we are Secondary and shall be
> promoted).
>
>
> now we have this:
> P/S --- S/P
> P/? -:- S/?
>
> A)
> if this is in fact (from the pov of heartbeat)
> P/? -.. XXX
> we stonith it (just to be sure) and tell it "peer dead"
> P/D -..
> (and there it resumes).
>
> B)
> if this is in fact (from the pov of heartbeat)
> P/? XXX S/?
> - we do nothing
> (blocks until network is fixed again)
> - we tell S that it is outdated,
> then tell P to resume
> - or we make it (by STONITH) into either A or C
>
> C)
> if this is in fact (from the pov of heartbeat)
> XXX ..- S/?
> we stonith it (just to be sure) and tell it "peer dead"
> XXX ..- S/D
> (and there it accepts to be promoted again).
>
>
> similar after bootup:
> we refuse to be promoted to Primary from Secondary/Unknown,
> unless we got an explicit "peer dead" confirmation by someone.
>
> does that make any sense?
>
I like it a lot!
Thus we will not call it "drbdadm resume-io r0" but
"drbdadm peer-dead r0"
I think the assertion that the peer is dead
(short "peer-dead") is a lot easier to understand than
a "resume-io" command.
Also the question at the startup-user-dialog:
Is the peer dead ?
Is easier to get right....
-Philipp
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :
More information about the drbd-dev
mailing list