[Linux-ha-dev] Re: [DRBD-user] drbd peer outdater: higher level implementation?

Sat Sep 13 22:44:56 CEST 2008

On Sat, Sep 13, 2008 at 06:55:05PM +0200, Lars Marowsky-Bree wrote:
> On 2008-09-13T14:52:53, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:
> 
> > so.
> > what you are suggeting is
> > 
> > when drbd loses replication link
> >   primary
> >      freezes and calles out to userland,
> >        telling heartbeat that the peer has "failed",
> >        in which case heartbeat would stop drbd on the secondary.
> >      either receives "secondary was stopped",
> >         maybe stores to meta data "_I_ am ahead of peer",
> >             (useful for cluster wide crash/reboot later)
> > 	and unfreezes
> >      or is being stopped itself
> >        (which would result in the node being self fenced, as the fs on
> >         top of drbd cannot be unmounted as drbd is freezed,...)
> >      or is even being shot as result of a cluster partition.
> > 
> >      so either primary continues to write,
> >      or it will soon look like a crashed primary.
> > 
> >   secondary
> >     sets a flag "primary may be ahead of me",
> >     then waits for
> >     either being stopped, in which case
> >       it would save to meta data "primary _IS_ ahead of me"
> >     or being told that the Primary was stopped
> >       when it would clear that flag again,
> >         maybe store to meta data "_I_ am ahead of peer"
> >       and then most likely soon after be promoted.
> 
> Okay, I think I have found a good name for the flag which I mean, and
> that should allow me to rephrase more clearly, and possibly simplify
> further Retry:
> 
> Unconnected secondary starts up with "peer state dirty", which is
> basically identical to "(locally) unsynced/inconsistent", but I think
> it's either to explain when I call it "peer dirty".

bad choice.
it has a meaning.
we have
uptodate (necessarily consistent),
consistent (not neccessarily uptodate)
outdated (consistent,
          but we know a more recent version existed at some point)
inconsistent.

with "dirty" I think of inconsistent.
that is something different than outdated.

> After it connects (and possibly resynchronizes), it clears the
> "peer dirty" flag locally. (Assume that it both sides are secondary at
> this point; that'd be the default during a cluster start-up.)
> 
> When one side gets promoted, the other side sets the "peer dirty" flag
> locally.  When it demotes, both sides clear it. Basically, each side
> gets to clear it when it notices that the peer is demoted.  So far, so
> good.
> 
> Scenario A - the replication link goes down:
> 
> - Primary:
>   - Freezes IO.
>   - Calls out to user-space to "fail" the peer.
>   - Gets confirmation that peer is stopped (via RA notification).
>   - Resumes IO.
> 
> - Secondary:
>   - Simply gets stopped.
>   - It'll assume "peer dirty" anyway, until it reconnects and
>     resyncs.

how can it possibly reconnect and resync,
if it is stopped?

> Scenario A - primary fails:
> 
> - Primary:
>   - Is dead. ;-)
> 
> Secondary:
>   - Gets confirmation that peer is stopped.
>   - Clears inconsistent flag (capable to resume IO).

you still ignore _my_ scenario 2,
I fail to see why you think you cover it.

right now, without dopd, and this "all new dopd in higher levels with
notifications and stuff" does not exist yet, either
this is possible:

situation 2: "outdate needed or data jumps back in time"

    replication link breaks
    primary keeps writing
        (which means secondary has now stale data)
    primary crashes
    heartbeat promotes secondary to primary
    and goes online with stale data.

variation: instead of primary crash, cluster crash.
    cluster reboot, replication link still broken.
    how do we prevent heartbeat from chosing the "wrong" node for promotion?

dopd handles both.
how does your proposal?
by stopping the secondary when the replication link broke?
but that must not happen. how could it then possibly resync, ever?
and it won't work for the variation with the cluster crash.

> Scenario C - secondary fails:
> 
> Primary:
> - Same as A, actually, from the point of view of the primary.
> 
> Secondary:
> - Either gets fenced, or stopped.
> 
> 
> (Note that A/B/C could actually work for active/active too, as long as
> there's a way to ensure that only one side calls out to fail its peer,
> and the other one - for the sake of this scenario - behaves like a
> secondary.)
> 
> > some questions:
> >   wouldn't that "peer has failed" first trigger a monitor?
> 
> No; it'd translate to a direct stop.
> 
> >   wouldn't that mean that on monitor, a not connected secondary would
> >   have to report "failed", as otherwise it would not get stopped?
> >   wouldn't that prevent normal failover?
> 
> Monitoring definitions are a slightly different matter. The result of a
> monitor is not the same as the ability/preference to become master.
> Indeed a failed resource will never get promoted, but a happy resource
> needn't call crm_master and thus not become promotable.
> 
> I think "monitor" would refer exclusively to local health - local
> storage read/writable, drbd running, etc. 
> 
> >   if not,
> >   wouldn't heartbeat try to restart the "failed" secondary?
> >   what would happen?
> 
> It might try to restart. But if a secondary gets restarted, it'll know
> from the environment variables that a peer exists; if it can't connect
> to that, it should fail the start - alternatively, it'd be up and
> running, but have "outdated/peer is dirty" set anyway, and so never
> announce it's ability to "promote".
> 
> >   what does a secondary do when started, and it finds the
> >     "primary IS ahead of me" flag in meta data?
> >     refuse to start even as slave?
> >       (would prevent it from ever being resync'ed!)
> >     start as slave, but refuse to be promoted?
> 
> The latter.
> 
> > problem: secondary crash.
> >    secondary reboots,
> >    heartbeat rejoins the cluster.
> >    
> >    replication link is still broken.
> > 
> >    secondary does not have "primary IS ahead of me" flag in meta data
> >    as because of the crash there was no way to store that.
> 
> >    would heartbeat try to start drbd (slave) here?
> >    what would trigger the "IS ahead of me" flag get stored on disk?
> 
> See above; it would _always_ come up with the assumption that "peer is
> dirty",

lets call it "peer may be more recent".

> and thus refuse to promote. No need to store anything on disk;
> it is the default assumption.

then you can never go online after cluster crash,
unless all drbd nodes come up _and_ can establish connection.

no availability does match the problem description
"don't go online with stale data."
but it is not exactly what we want.

I need the ability to store on disk that "_I_ am ahead of peer"
if I know for sure, so I can be promoted after crash/reboot.

> >    if for some reason policy engine now figures the master should rather
> >    run on the just rejoined node, how can that migration be prevented?
> 
> That's a different discussion, but: the ability (and preference for) to
> become primary is explicitly set by the RA through the call to
> "crm_master".
> 
> If it is unable to become master, it would call "crm_master -D"; it'll
> then _never_ be promoted.

for that, it would need to know first that it is outdated.  so it _is_
the same problem to some degree, as it depends on that solution.

> > I'm still not convinced that this method
> > covers as many as dopd as good as dopd.
> 
> I think so. At least my proposal is becoming more concise, which is good
> for review ;-)

this time you made a step backwards, as you seem to think that drbd does
not need to store any information about being outdated.

to again point out what problem we are trying to solve:
  whenever a secondary is about to be promoted,
  it needs to be "reasonably" certain that is has the most recent data,
  otherwise it would refuse.
  it does not matter whether the promotion attempt happens
  right after connection loss,
  or three and a half days, two cluster crashes
  and some node reboots later.

  as it is almost impossible to be certain that you have the most recent
  data, but it is very well possible to know that you are outdated (as
  that does not change without a resync),
  the dopd logic revolves around "outdate".

we thoroughly thought about how to solve it.
the result was dopd.
any solution to replace dopd must at least
cover as many scenarios as good as dopd.

the best way to replace dopd would be to find a more "high level"
mechanism for a surviving Primary to actively signal a surviving
Secondary to outdate itself (or get feedback why that was not possible),
and for a not-connected Secondary which is about to be promoted
to ask its peer to outdate itself, which may be refused as it may be
primary.

if you want to solve it differently,
it becomes a real mess of fragile complex hackwork and assumptions.

if it is possible to express
"hello crm, something bad has happened,
would you please notify my other
clones/masters/slaves/whatever the terminus
that I am still alive, and about to continue to change the data."
and, tell me when that is done, so I can unfreeze",
that would take it half way. to fully replace dopd,
we'd need a way to communicate back.

if it is possible for the master to tell the crm that the slave has
failed and should therefore be stopped, then this should not be that
difficult.

alternatively, if we can put some state into the cib
(in current drbd e.g. the "data generation uuids"),
that might work as well.

"failing", i.e. stopping, the slave
just because a replication link hickup
is no solution.

-- 
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH
__
please don't Cc me, but send to list   --   I'm subscribed