[Drbd-dev] Another drbd race

Tue Sep 7 12:13:43 CEST 2004

On Tue, Sep 07, 2004 at 11:39:29AM +0200, Philipp Reisner wrote:
> On Saturday 04 September 2004 12:00, Lars Ellenberg wrote:
> > On Sat, Sep 04, 2004 at 11:48:14AM +0200, Lars Marowsky-Bree wrote:
> > > Hi,
> > >
> > > lge and I have yesterday discussed a 'new' drbd race condition and also
> > > touched on its resolution.
> > >
> > > Scope: in a split-brain, drbd might confirm write to the clients and
> > > might on a subsequent failover lose the transactions which _have been
> > > confirmed_. This is not acceptable.
> > >
> > > Sequence:
> > >
> > > Step N1 Link N2
> > > 1 P ok S
> > > 2 P breaks S node1 notices, goes into stand alone,
> > >     stops waiting for N2 to confirm.
> > > 3 P broken S S notices, initiates fencing
> > > 4 x broken P N2 becomes primary
> > >
> > > Writes which have been done in between step 2-4 will have been confirmed
> > > to the higher layers, but are not actually available on N2. This is data
> > > loss; N2 is still consistent, but lost confirmed transaction.
> > >
> > > Partially, this is solved by the Oracle-requested "only ever confirm if
> > > committed to both nodes", but of course then if it's not a broken link,
> > > but N2 really went down, we'd be blocking on N1 forever, which we don't
> > > want to do for HA.
> > >
> > > So, here's the new sequence to solve this:
> > >
> > > Step N1 Link N2
> > > 1 P ok S
> > > 2 P(blk) ok X P blocks waiting for acks; heartbeat
> > >     notices that it has lost N2, and initiates
> > >     fencing.
> > > 3 P(blk) ok fenced heartbeat tells drbd on N1 that yes, we
> > >     know it's dead, we fenced it, no point
> > >     waiting.
> > > 4 P ok fenced Cluster proceeds to run.
> > >
> > > Now, in this super-safe mode, if now N1 also fails after step 3 but
> > > before N2 comes back up and is resynced, we need to make sure that N2
> > > does refuse to become primary itself. This will probably require
> > > additional magic in the cluster manager to handle correctly, but N2
> > > needs an additional flag to prevent this from happening by accident.
> > >
> > > Lars?
> >
> > I think we can do this detection already with the combination of the
> > Consistent and Connected as well as HaveBeenPrimary flag. Only the logic
> > needs to be built in.
> >
> 
> I do not want to "misuse" the Consistent Bit for this.
> 
> !Consistent  .... means that we are in the middle of a sync.
>                    = data is not usable at all.
>  Fenced      .... our data is 100% okay, but not the latest copy.

lets call it "Outdated"

my idea is that a crashed Secondary will come up as !Primary|Connected, so
it can assume it is outdated. (similar to the choice about wfc-degr...)

we can only possibly lose write transaction in the very moment we
promote a Secondary to Primary. until we do that, and the harddisk where
the transactions have been written to is still physically intact, the
data is still there, though maybe not available.

we can try to make sure that we never promote a Secondary that possibly
(or knowingly) is outdated.

see below.

> 
> 
> > Most likely right after connection loss the Primary should blocks for a
> > configurable (default: infinity?) amount of time before giving end_io
> > events back to the upper layer.
> > We then need to be able to tell it to resume operation (we can do this,
> > as soon as we took precautions to prevent the Secondary to become
> > Primary without being forced or resynced before).
> >
> > Or, if the cluster decides to do so, the Secondary has time to STONITH
> > the Primary (while that is still blocking) and take over.
> >
> > I want to include a timeout, so the cluster manager don't need to
> > know about "peer is dead" notification, it only needs to know about
> > STONITH.
> 
> I see. Makes sense, but on the other hand STONITH (more genral:
> FENCING)  might fail, as LMB points out in one of the other mails.
> 
> -> We should probabely _not_ offer a timeout here, as soon as
>    "on-disconnect freeze_io;" is set, it is freezed forever.
>    Or it gets a "drbdadm resume-io r0" from the cluster manager.
> 
> > Maybe we want to introduce this functionality as a new wire protocoll,
> > or only in proto C.
> >
> 
> I see it controled by the 
> 
> "on-disconnect freeze_io;" option.
> 
> For N2 we need a "drbdadm fence-off r0" command and for N1 we need 
> a "drbdadm resume-io r0".
> 
> * The fenced bit gets cleard when the resync is finished.
> * A node refuses to become primary when the fenced bit is set.
> * "drbdadm -- --do-what-I-say primary r0" overrules (and cleares?) 
>   the fenced bit
> 
> To be defined: What should we do at node startup with the fenced bit.
>                (At least display it at the user-dialog)

I would like to introduce an additional Node state for the o_state:
Dead. it is never "recognized" internally, but can be set by the
operator or cluster manager. basically, if we go to WhatEver/Unknown,
we don't accept anything (since we don't want to risk split brain).
some higher authority can and needs to resolve this, telling us the peer
is dead (after a successfull stonith, when we are Secondary and shall be
promoted).

now we have this:
  P/S --- S/P
  P/? -:- S/?

 A)
  if this is in fact (from the pov of heartbeat)
  P/? -.. XXX
    we stonith it (just to be sure) and tell it "peer dead"
  P/D -..   
    (and there it resumes).

 B)
  if this is in fact (from the pov of heartbeat)
  P/? XXX S/?
    - we do nothing
      (blocks until network is fixed again)
    - we tell S that it is outdated,
      then tell P to resume
    - or we make it (by STONITH) into either A or C

 C)
  if this is in fact (from the pov of heartbeat)
  XXX ..- S/?
    we stonith it (just to be sure) and tell it "peer dead"
  XXX ..- S/D
    (and there it accepts to be promoted again).

similar after bootup:
  we refuse to be promoted to Primary from Secondary/Unknown,
  unless we got an explicit "peer dead" confirmation by someone.

does that make any sense?

  	lge