[Linux-ha-dev] Re: [DRBD-user] drbd peer outdater: higher level implementation?

Lars Marowsky-Bree lmb at suse.de
Sun Sep 14 00:28:51 CEST 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

On 2008-09-13T22:44:56, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:

> bad choice.
> it has a meaning.
> we have
> uptodate (necessarily consistent),
> consistent (not neccessarily uptodate)
> outdated (consistent,
>           but we know a more recent version existed at some point)
> inconsistent.

Ok, ok. I'll call it "the flag" then for the time being ;-)

> > After it connects (and possibly resynchronizes), it clears the
> > "peer dirty" flag locally. (Assume that it both sides are secondary at
> > this point; that'd be the default during a cluster start-up.)
> > 
> > When one side gets promoted, the other side sets the "peer dirty" flag
> > locally.  When it demotes, both sides clear it. Basically, each side
> > gets to clear it when it notices that the peer is demoted.  So far, so
> > good.
> > 
> > Scenario A - the replication link goes down:
> > 
> > - Primary:
> >   - Freezes IO.
> >   - Calls out to user-space to "fail" the peer.
> >   - Gets confirmation that peer is stopped (via RA notification).
> >   - Resumes IO.
> > 
> > - Secondary:
> >   - Simply gets stopped.
> >   - It'll assume "peer dirty" anyway, until it reconnects and
> >     resyncs.
> how can it possibly reconnect and resync,
> if it is stopped?

I meant "eventually", ie sometime the admin is going to fix it and then
it'll be able to reconnect and resync, and clear the flag.

The emphasis is on the first part of the sentence - as the flag is set
by default on start-up anyway, the secondary can "simply" be stopped w/o
needing to write anything to disk.

> you still ignore _my_ scenario 2,
> I fail to see why you think you cover it.

I thought it is covered. I'll try again.

> right now, without dopd, and this "all new dopd in higher levels with
> notifications and stuff" does not exist yet, either
> this is possible:

Well, of course I'm describing the target scenario, not the current one.
I entirely agree that that is possible right now.

> situation 2: "outdate needed or data jumps back in time"
>     replication link breaks

That's the very first scenario I described!

>     primary keeps writing
>         (which means secondary has now stale data)

First, it wouldn't keep writing in the described approach, but freeze,
and only resume to write after it has been notified that the peer has
been stopped.

>     primary crashes
>     heartbeat promotes secondary to primary
>     and goes online with stale data.

Second, even if Pacemaker would restart the secondary (which was stopped
due to the failure), the secondary would be unable to promote as "the
flag" would be set by default on start-up.

I really believe that the approach I described covers this.

> variation: instead of primary crash, cluster crash.
>     cluster reboot, replication link still broken.
>     how do we prevent heartbeat from chosing the "wrong" node for promotion?

This scenario is indeed not perfectly handled by my approach as
described: it does handle that the "wrong" secondary doesn't get
promoted, but it would indeed prevent _both_ sides from being promoted,
which is not good.

First, the theoretical response to this is that replication link down
plus crash of two nodes actually constitutes a triple failure, and thus
not one we claim the cluster protects against. ;-) For some customers,
manual intervention here would be acceptable.

But second, a possible solution is to write a persistent "I was primary"
flag to the meta-data. On start, this would then set crm_master's
preference to non-zero value (say, 1), which would allow the node to be
promoted. This might be a tunable operation.

> dopd handles both.
> how does your proposal?
> by stopping the secondary when the replication link broke?

That's what I explained, yes.

> but that must not happen. how could it then possibly resync, ever?

Pacemaker can be configured to restart it too, which would attempt a
reconnect (or even attempt the reconnect periodically, if the RA would
fail to start if unable to connect to the peer, but that might not even
be needed - restarting it once and keeping it running is sufficient).

Further, I might wish to actually stop the secondary _to be able to move
it to another node_ (which might be able to reconnect & resync).

> > See above; it would _always_ come up with the assumption that "peer is
> > dirty",
> lets call it "peer may be more recent".

OK. I'm still calling it "the flag" because it's easier to type ;-)

> > and thus refuse to promote. No need to store anything on disk;
> > it is the default assumption.
> then you can never go online after cluster crash,
> unless all drbd nodes come up _and_ can establish connection.

See above for one possible solution.

Okay, now you're going to propose the following scenario:

- Primary N1 crashes
- Secondary N2 gets promoted
- Cluster crash
- Replication link down
- Both nodes N1+N2 up

With the extension I propose above, both sides would set the same master
preference, while we'd obviously want N2 promoted, not N1. But then,
dopd wouldn't help this. Instead of writing 1 though, they could use one
of the generation counters (primary transitions seen?), which would be
n+1 for N2 and cause N2 to be (correctly) promoted.

(Of course I can construct a sequence of failures which would break even
that, to which I'd reply that they really should simply use the same
bonded interfaces for both their cluster traffic _and_ the replication,
to completely avoid this problem ;-)

> no availability does match the problem description
> "don't go online with stale data."
> but it is not exactly what we want.

Depends on the scenario, but I think my above scenario works fine.

> I need the ability to store on disk that "_I_ am ahead of peer"
> if I know for sure, so I can be promoted after crash/reboot.

Ok, I see your point, and that is I think what I proposed above.

> > I think so. At least my proposal is becoming more concise, which is good
> > for review ;-)
> this time you made a step backwards, as you seem to think that drbd does
> not need to store any information about being outdated.

I actually still think this is so, yes.

> to again point out what problem we are trying to solve:
>   whenever a secondary is about to be promoted,
>   it needs to be "reasonably" certain that is has the most recent data,
>   otherwise it would refuse.
>   it does not matter whether the promotion attempt happens
>   right after connection loss,
>   or three and a half days, two cluster crashes
>   and some node reboots later.

Right, and agreed.

>   as it is almost impossible to be certain that you have the most recent
>   data, but it is very well possible to know that you are outdated (as
>   that does not change without a resync),
>   the dopd logic revolves around "outdate".

No disagreement there. I'm not saying dopd doesn't solve the problem.
I'm just trying to find a solution which solves it without needing dopd,
but which can instead leverage that Pacemaker is quite a bit smarter
than heartbeat-v1; hence my proposal above.

> we thoroughly thought about how to solve it.
> the result was dopd.
> any solution to replace dopd must at least
> cover as many scenarios as good as dopd.

Of course. I'm not disagreeing.

> the best way to replace dopd would be to find a more "high level"
> mechanism for a surviving Primary to actively signal a surviving
> Secondary to outdate itself (or get feedback why that was not possible),

Restarting it does that in my proposal (it would possibly come back up
with 'the flag' set by default) - and it does get active feedback that
the peer was stopped.

Indeed, it would NOT get feedback if that was not possible - that is a
new requirement. But that's impossible (okay, okay, "unlikely"), as
failure to stop would trigger the recovery escalation and eventually
stonith the former peer. Of course, if that fails _too_, but then I
think we've arrived at so many failures that "freeze" is an acceptable

> and for a not-connected Secondary which is about to be promoted
> to ask its peer to outdate itself, which may be refused as it may be
> primary.

I don't see the need for this second requirement. First, a not-connected
secondary in my example would never promote (unless it was primary
before, with the extension); second, if a primary exists, the cluster
would never promote a second one (that's protected against).

> if you want to solve it differently,
> it becomes a real mess of fragile complex hackwork and assumptions.

Please, don't call something that I thought a lot about is "a mess and
hackwork" - that is sort of too easy to take personal. It is sufficient
to point out why it doesn't work ;-) I'm not saying I got it right; I
just _think_ I got it right.

And that doesn't mean that I'm disagreeing that dopd also solves it. But
I thought the intention was to try and get rid of it. My goal is to make
the entire setup less complex, which means cutting out as much as
possible with the intent of making it less fragile and easier to setup.

(And of course, to find out if there are things which are missing in the
m/s concept which might need to be introduced to achieve the former.)

> if it is possible to express
> "hello crm, something bad has happened,
> would you please notify my other
> clones/masters/slaves/whatever the terminus
> that I am still alive, and about to continue to change the data."
> and, tell me when that is done, so I can unfreeze",
> that would take it half way.

That is quite easily doable.

> to fully replace dopd, we'd need a way to communicate back.
> if it is possible for the master to tell the crm that the slave has
> failed and should therefore be stopped, then this should not be that
> difficult.

Right, exactly.

> alternatively, if we can put some state into the cib
> (in current drbd e.g. the "data generation uuids"),
> that might work as well.

Yes, that's somewhat what I proposed to do to solve the N1 versu N2

> "failing", i.e. stopping, the slave
> just because a replication link hickup
> is no solution.

Depends, as above - it could get restarted (possibly elsewhere) and then
try to reconnect, but the primary would be _sure_ that the secondary
will not be promoted.

To be frank, the _real_ problem we're solving here is that drbd and the
cluster layer are (or at least can be) disconnected, and trying to
figure out how much of that is needed and how to achieve it. If all
meta-data communication were taken out of drbd's data channel but
instead routed through user-space and the cluster layer, none of this
could happen, and the in-kernel implementation probably quite
simplified. But that strikes me as quite a stretch goal for the time

The easiest way to achieve this in 99% of all real-world cases with the
current code probably is to setup a bonded interface and route both the
cluster traffic and drbd across it. The likelihood of the two diverging
then approaches epsilon. I sometimes wonder if that would not be the
ultimately smarter thing to recommend than trying to implement complex
code. ;-)


Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

More information about the drbd-user mailing list