[Linux-ha-dev] Re: [DRBD-user] drbd peer outdater: higher level implementation?

Sun Sep 14 16:31:36 CEST 2008

On Sun, Sep 14, 2008 at 12:28:51AM +0200, Lars Marowsky-Bree wrote:
> > > After it connects (and possibly resynchronizes), it clears the
> > > "peer dirty" flag locally. (Assume that it both sides are secondary at
> > > this point; that'd be the default during a cluster start-up.)
> > > 
> > > When one side gets promoted, the other side sets the "peer dirty" flag
> > > locally.  When it demotes, both sides clear it. Basically, each side
> > > gets to clear it when it notices that the peer is demoted.  So far, so
> > > good.
> > > 
> > > Scenario A - the replication link goes down:
> > > 
> > > - Primary:
> > >   - Freezes IO.
> > >   - Calls out to user-space to "fail" the peer.
> > >   - Gets confirmation that peer is stopped (via RA notification).
> > >   - Resumes IO.
> > > 
> > > - Secondary:
> > >   - Simply gets stopped.
> > >   - It'll assume "peer dirty" anyway, until it reconnects and
> > >     resyncs.
> > 
> > how can it possibly reconnect and resync,
> > if it is stopped?
> 
> I meant "eventually", ie sometime the admin is going to fix it and then
> it'll be able to reconnect and resync, and clear the flag.

admin intervention required for a network hickup.
not an option.

> > right now, without dopd, and this "all new dopd in higher levels with
> > notifications and stuff" does not exist yet, either
> > this is possible:
> 
> Well, of course I'm describing the target scenario, not the current one.
> I entirely agree that that is possible right now.

sure. but we have dopd. and it covers this.
master/slave notifications alone, as was your original proposal,
certainly cannot, as you meanwhile noticed.
your current, combined proposal involving the notifications for some
part, and calling out to "fail" a node, i.e.  stop the secondary because
of a network hickup, is worse than dopd.

you try to concinve me to stay with dopd ;)

> > situation 2: "outdate needed or data jumps back in time"
> > 
> >     replication link breaks
> 
> That's the very first scenario I described!
> 
> >     primary keeps writing
> >         (which means secondary has now stale data)
> 
> First, it wouldn't keep writing in the described approach, but freeze,
> and only resume to write after it has been notified that the peer has
> been stopped.
> 
> >     primary crashes
> >     heartbeat promotes secondary to primary
> >     and goes online with stale data.
> 
> Second, even if Pacemaker would restart the secondary (which was stopped
> due to the failure), the secondary would be unable to promote as "the
> flag" would be set by default on start-up.
> 
> I really believe that the approach I described covers this.

and needs admin intervention to start the secondary again,
just because some switch reset.

> > variation: instead of primary crash, cluster crash.
> >     cluster reboot, replication link still broken.
> >     how do we prevent heartbeat from chosing the "wrong" node for promotion?
> 
> This scenario is indeed not perfectly handled by my approach as
> described: it does handle that the "wrong" secondary doesn't get
> promoted, but it would indeed prevent _both_ sides from being promoted,
> which is not good.
> 
> First, the theoretical response to this is that replication link down
> plus crash of two nodes actually constitutes a triple failure, and thus
> not one we claim the cluster protects against. ;-) For some customers,
> manual intervention here would be acceptable.

dopd handles it, your proposal does not.
dopd is already there.
dopd is simpler.
dopd wins.

> But second, a possible solution is to write a persistent "I was primary"
> flag to the meta-data.

already there.

> On start, this would then set crm_master's
> preference to non-zero value (say, 1), which would allow the node to be
> promoted. This might be a tunable operation.

absolutely not.  for a realy primary crash, the secondary was promoted.
cluster crash.  both have their "I was primary" flag set.
its not that simple.

you must not focus on one scenario.  you need to keep them all in mind
when designing a replacement for dopd.
it is not that simple, after all, right?

> > dopd handles both.
> > how does your proposal?
> > by stopping the secondary when the replication link broke?
> 
> That's what I explained, yes.
> 
> > but that must not happen. how could it then possibly resync, ever?
> 
> Pacemaker can be configured to restart it too, which would attempt a
> reconnect (or even attempt the reconnect periodically, if the RA would
> fail to start if unable to connect to the peer, but that might not even
> be needed - restarting it once and keeping it running is sufficient).

drbd attempts to reconnect on its own.
dopd does not need to restart it.
dopd wins.

> Further, I might wish to actually stop the secondary _to be able to move
> it to another node_ (which might be able to reconnect & resync).

The typical deployment with DRBD is still
two nodes with direct attached storage each.
if you are arguing floating peers around SANs,
then that is a rare special case.
if you are arguing "cold spare" drbd nodes with DAS,
(which is an even more rare special case) you really think that getting
rid of dopd was worth a full sync every time we have a network kickup?

> > > See above; it would _always_ come up with the assumption that "peer is
> > > dirty",
> > 
> > lets call it "peer may be more recent".
> 
> OK. I'm still calling it "the flag" because it's easier to type ;-)

and, btw, drbd does already do so, and would call out to dopd if you try
to promote an unconnected Secondary. unless it already knows that it
itself is outdated, which makes it refuse right away.

> > > and thus refuse to promote. No need to store anything on disk;
> > > it is the default assumption.
> > 
> > then you can never go online after cluster crash,
> > unless all drbd nodes come up _and_ can establish connection.
> 
> See above for one possible solution.

I'm just pointing out that you started out claiming
 "I actually think that dopd is the real hack, and drbd instead should
  listen to the notifications we provide, and infer the peer state by that
  means ..."
I accused you of handwaving, and you said no, it is all perfectly clear.

now, when it comes to fill in those dots,
you need to reconsider again and again.
while dopd is already there.
and even if is is a "hack", does a pretty good job.

> Okay, now you're going to propose the following scenario:
> 
> - Primary N1 crashes
> - Secondary N2 gets promoted
> - Cluster crash
> - Replication link down
> - Both nodes N1+N2 up

good.
I see you start to look at the whole picture ;)

> With the extension I propose above, both sides would set the same master
> preference, while we'd obviously want N2 promoted, not N1. But then,
> dopd wouldn't help this. Instead of writing 1 though, they could use one
> of the generation counters (primary transitions seen?), which would be
> n+1 for N2 and cause N2 to be (correctly) promoted.

btw.
there are no generation counters anymore.
there is a reason for that: they are not reliable.
drbd8 does compare a history of generation UUIDs.  while still not
perfect, it is much more reliable than generation counters.

> (Of course I can construct a sequence of failures which would break even
> that, to which I'd reply that they really should simply use the same
> bonded interfaces for both their cluster traffic _and_ the replication,
> to completely avoid this problem ;-)

it does not need to be perfect.
it just needs to be as good as "the hack" dopd.
and preferably as simple.

> > no availability does match the problem description
> > "don't go online with stale data."
> > but it is not exactly what we want.
> 
> Depends on the scenario, but I think my above scenario works fine.
> 
> > I need the ability to store on disk that "_I_ am ahead of peer"
> > if I know for sure, so I can be promoted after crash/reboot.
> 
> Ok, I see your point, and that is I think what I proposed above.
> 
> > > I think so. At least my proposal is becoming more concise, which is good
> > > for review ;-)
> > this time you made a step backwards, as you seem to think that drbd does
> > not need to store any information about being outdated.
> 
> I actually still think this is so, yes.
> 
> > to again point out what problem we are trying to solve:
> >   whenever a secondary is about to be promoted,
> >   it needs to be "reasonably" certain that is has the most recent data,
> >   otherwise it would refuse.
> >   it does not matter whether the promotion attempt happens
> >   right after connection loss,
> >   or three and a half days, two cluster crashes
> >   and some node reboots later.
> 
> Right, and agreed.
> 
> >   as it is almost impossible to be certain that you have the most recent
> >   data, but it is very well possible to know that you are outdated (as
> >   that does not change without a resync),
> >   the dopd logic revolves around "outdate".
> 
> No disagreement there. I'm not saying dopd doesn't solve the problem.
> I'm just trying to find a solution which solves it without needing dopd,
> but which can instead leverage that Pacemaker is quite a bit smarter
> than heartbeat-v1; hence my proposal above.
> 
> > we thoroughly thought about how to solve it.
> > the result was dopd.
> > any solution to replace dopd must at least
> > cover as many scenarios as good as dopd.
> 
> Of course. I'm not disagreeing.
> 
> > the best way to replace dopd would be to find a more "high level"
> > mechanism for a surviving Primary to actively signal a surviving
> > Secondary to outdate itself (or get feedback why that was not possible),
> 
> Restarting it does that in my proposal (it would possibly come back up
> with 'the flag' set by default) - and it does get active feedback that
> the peer was stopped.
> 
> Indeed, it would NOT get feedback if that was not possible - that is a
> new requirement. But that's impossible (okay, okay, "unlikely"), as
> failure to stop would trigger the recovery escalation and eventually
> stonith the former peer. Of course, if that fails _too_, but then I
> think we've arrived at so many failures that "freeze" is an acceptable
> response.
> 
> > and for a not-connected Secondary which is about to be promoted
> > to ask its peer to outdate itself, which may be refused as it may be
> > primary.
> 
> I don't see the need for this second requirement. First, a not-connected
> secondary in my example would never promote (unless it was primary
> before, with the extension); second, if a primary exists, the cluster
> would never promote a second one (that's protected against).

we have to unconnected secondaries.
for this example, lets even assume they are equivalent,
have the very same data generation.
we promote one of them. so the other needs to be outdated.

> > if you want to solve it differently,
> > it becomes a real mess of fragile complex hackwork and assumptions.
> 
> Please, don't call something that I thought a lot about is "a mess and
> hackwork" - that is sort of too easy to take personal.

I don't mean it that way, you know that.
but, while at that level,
 "you said hack first" ;)

> t is sufficient to point out why it doesn't work ;-) I'm not saying I
> got it right; I just _think_ I got it right.

and you never say what you think ;^)

> And that doesn't mean that I'm disagreeing that dopd also solves it. But
> I thought the intention was to try and get rid of it.

the intention is to replace it with something as effective,
hopefully less dependend on low level infrastructure,
and possibly simpler.
so far, I think it is still simpler and more effective.

> My goal is to make the entire setup less complex, which means cutting
> out as much as possible with the intent of making it less fragile and
> easier to setup.

I fully agree to that goal.

> (And of course, to find out if there are things which are missing in the
> m/s concept which might need to be introduced to achieve the former.)
> 
> > if it is possible to express
> > "hello crm, something bad has happened,
> > would you please notify my other
> > clones/masters/slaves/whatever the terminus
> > that I am still alive, and about to continue to change the data."
> > and, tell me when that is done, so I can unfreeze",
> > that would take it half way.
> 
> That is quite easily doable.
> 
> > to fully replace dopd, we'd need a way to communicate back.
> > 
> > if it is possible for the master to tell the crm that the slave has
> > failed and should therefore be stopped, then this should not be that
> > difficult.
> 
> Right, exactly.
> 
> > alternatively, if we can put some state into the cib
> > (in current drbd e.g. the "data generation uuids"),
> > that might work as well.
> 
> Yes, that's somewhat what I proposed to do to solve the N1 versu N2
> scenario.

that is something I mentioned as a result of a linbit internal
discussion in my mail Date: Sat, 13 Sep 2008 03:01:11 +0200

> > "failing", i.e. stopping, the slave
> > just because a replication link hickup
> > is no solution.
> 
> Depends, as above - it could get restarted (possibly elsewhere) and then
> try to reconnect, but the primary would be _sure_ that the secondary
> will not be promoted.
> 
> To be frank, the _real_ problem we're solving here is that drbd and the
> cluster layer are (or at least can be) disconnected, and trying to
> figure out how much of that is needed and how to achieve it.

that is good. thats why I'm here, writing back,
so you can sharpen you proposal on my rejection
until its sharp enough.

> If all meta-data communication were taken out of drbd's data channel
> but instead routed through user-space and the cluster layer, none of
> this could happen, and the in-kernel implementation probably quite
> simplified. But that strikes me as quite a stretch goal for the time
> being.

you can have that. don't use drbd then. use md.
but there was a reason that you did not.
thats not really suitable for that purpose.
right.

> The easiest way to achieve this in 99% of all real-world cases with the
> current code probably is to setup a bonded interface and route both the
> cluster traffic and drbd across it. The likelihood of the two diverging
> then approaches epsilon. I sometimes wonder if that would not be the
> ultimately smarter thing to recommend than trying to implement complex
> code. ;-)

we do recommend to use a bonded replication link.
and, dopd is simple, and it is implemented.

please, don't give up yet.
for starting out with three dots,
your proposal is amazingly good already.
it just needs to simplify some more,
and maybe get rid of required spurious restarts.
think about it some more, and I'll finally give in.

-- 
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH
__
please don't Cc me, but send to list   --   I'm subscribed