[Linux-ha-dev] Re: [DRBD-user] drbd peer outdater: higher level implementation?

Sun Sep 14 18:10:50 CEST 2008

On 2008-09-14T16:31:36, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:

> > I meant "eventually", ie sometime the admin is going to fix it and then
> > it'll be able to reconnect and resync, and clear the flag.
> admin intervention required for a network hickup.
> not an option.

Depends on how the cluster is configured. A restart can happen
automatically, too. And, with floating peers implemented right, possibly
on another node - which some customers might like.

> > Well, of course I'm describing the target scenario, not the current one.
> > I entirely agree that that is possible right now.
> sure. but we have dopd. and it covers this.
> master/slave notifications alone, as was your original proposal,
> certainly cannot, as you meanwhile noticed.
> your current, combined proposal involving the notifications for some
> part, and calling out to "fail" a node, i.e.  stop the secondary because
> of a network hickup, is worse than dopd.

I'm not sure it is. It achieves what dopd does w/o needing dopd, but by
interacting with the cluster layer only, and by, I think, simplyfing
drbd's logic - I think that's a win.

And yes, for the cluster to notice that something needs to be done about
the replication link going down _requires_ some call-out of some form.
That's internal state of drbd the cluster naturally doesn't have access
to, so it must be communicated somehow. However, dopd still remains
internal and unknown to the cluster; so the cluster's policy system
can't help with the recovery. Exposing this might have some charm,
described further below.

The notifications provide each clone (or m/s) instance with the
cluster's state about the peers and intended/completed state changes, so
I think those are useful.

> you try to concinve me to stay with dopd ;)

Well, yes; that is one possible result of exploring the other
alternatives, at least then we'll all agree and understand why that is
the case. Or even to identify that we might have scenarios where dopd is
needed, and others where a different approach is recommended ...

> > Second, even if Pacemaker would restart the secondary (which was stopped
> > due to the failure), the secondary would be unable to promote as "the
> > flag" would be set by default on start-up.
> > 
> > I really believe that the approach I described covers this.
> 
> and needs admin intervention to start the secondary again,
> just because some switch reset.

How so? In fact, the default response by Pacemaker to a failed resource
is to _restart_ it. No admin intervention required. 

But that's tunable; it could be set to restart it only N times within M
seconds, or fail-over to a different node etc - that strikes me as quite
powerful enough.

> > First, the theoretical response to this is that replication link down
> > plus crash of two nodes actually constitutes a triple failure, and thus
> > not one we claim the cluster protects against. ;-) For some customers,
> > manual intervention here would be acceptable.
> 
> dopd handles it, your proposal does not.
> dopd is already there.
> dopd is simpler.
> dopd wins.

I disagree. I was merely pointing out the fact that we can always
construct failure sequences which are not satisfactorily solved. 

For example, in your scenario, if after the cluster crash only the old
secondary comes back, I am _sure_ there is someone out there who'd
rather continue serving data with possibly a few transactions missing
than not at all, which would require an admin to step in and clear the
outdated flag.

(And no, I have no answer to this case ;-)

> > But second, a possible solution is to write a persistent "I was primary"
> > flag to the meta-data.
> already there.

Perfect ;-) Then nothing further is needed.

> > On start, this would then set crm_master's
> > preference to non-zero value (say, 1), which would allow the node to be
> > promoted. This might be a tunable operation.
> absolutely not.  for a realy primary crash, the secondary was promoted.
> cluster crash.  both have their "I was primary" flag set.
> its not that simple.

I know that. As I've pointed out later describing _exactly this
scenario_, they could instead use the "number of primary transitions
seen" (acknowleding the short-comings of g-c's, but which would be quite
reasonable here), which would make Pacemaker prefer the more "recent"
node.

But actually, in this case, neither side would have the "outdated" flag
set either, so we're actually discussing something not quite related to
dopd anyway, aren't we?

> > > but that must not happen. how could it then possibly resync, ever?
> > Pacemaker can be configured to restart it too, which would attempt a
> > reconnect (or even attempt the reconnect periodically, if the RA would
> > fail to start if unable to connect to the peer, but that might not even
> > be needed - restarting it once and keeping it running is sufficient).
> 
> drbd attempts to reconnect on its own.

Exactly. Hence why it would do that after a restart.

> dopd does not need to restart it.
> dopd wins.

"restarting it" is merely a way of achieving the dopd functionality w/o
needing dopd. _Of course_ dopd can already do that. If you're going to
critique my proposal on the basis that it doesn't do more than dopd, of
course you're going to be right. ;-)

Anyway, I can actually point out a few cases: the restart might happen
on another node, which dopd can't achieve.  The (possibly forced)
restart might clear up some internal state within drbd (or the whole
system) which might allow it to reconnect.

The former is my pet project of floating peers, but the latter is not
that unlikely, either. Many errors are transient and are solved by a
restart (if not even a reboot).

> The typical deployment with DRBD is still
> two nodes with direct attached storage each.

Yes. I understand that. I think it likely accounts for 95% of all our
customer deployments, if not 98%. This is partly because drbd is hard to
extend to other scenarios right now though, not because there would not
be a need for it.

> if you are arguing floating peers around SANs,
> then that is a rare special case.

Yes, of course I know. I'm not sure it would stay as rare as that,
though. But it would enable drbd to be used in rather interesting and
more ... lucrative deployments, too.

> if you are arguing "cold spare" drbd nodes with DAS,
> (which is an even more rare special case) you really think that getting
> rid of dopd was worth a full sync every time we have a network kickup?

Well, first, this is not that rare in demand, though of course not
easily possible today. Some customers with bladecenters would quite like
this. 

Second, no, of course not every time. Pacemaker could be configured to
try up to, say, 3 local restarts within 24 hours before doing a
fail-over to another node - and suddenly the deployment sounds a bit
more attractive ...

For these last two "rare" scenarios, the fail-over might not just be
caused by the replication link going down, but also by the node losing
its storage or the connection to it, in which case a fail-over is quite
desirable.

> > See above for one possible solution.
> 
> I'm just pointing out that you started out claiming
>  "I actually think that dopd is the real hack, and drbd instead should
>   listen to the notifications we provide, and infer the peer state by that
>   means ..."
> I accused you of handwaving, and you said no, it is all perfectly clear.

Well, I admit to having been wrong on the "perfectly clear". I thought
it was clear, and the elaborate discussion is quite helpful.

And calling dopd the real hack might have been offensive, for which I
apologize. But I'd still like understand if we could do without it, and
possibly even achieve more.

> now, when it comes to fill in those dots,
> you need to reconsider again and again.

Right. That tends to happen during a dialogue - it would make me look
rather silly if I ignored new insights, wouldn't it? ;-)

> while dopd is already there.

Yes, it's there for heartbeat, but it is not there at all for openAIS,
and I don't think it works well with the m/s resources (which I'm also
trying to improve here). So I'm looking how we could achieve this for
the future.

(Personally, I consider heartbeat as the cluster comm layer as dead as
you think drbd-0.7 to be; it won't be around on SLE11, for example. So
we really need to find out how we could merge the two well.)

This would, as it relies "only" on Pacemaker, continue working on top of
the heartbeat comm-layer of course too.

> and even if is is a "hack", does a pretty good job.

True.

> good.
> I see you start to look at the whole picture ;)

I'm always happy to learn ;-)

> btw.
> there are no generation counters anymore.
> there is a reason for that: they are not reliable.
> drbd8 does compare a history of generation UUIDs.  while still not
> perfect, it is much more reliable than generation counters.

Good to know.

But, even if this is somewhat unrelated to the "outdated" discussion,
how would you translate this to the "N1+N2" (ie, two former primaries)
recovery scenario? Compared to heartbeat v1, at least Pacemaker would
allow you to declare your preference for becoming primary, but that
needs to be numeric (higher integer wins). Maybe worth a separate
thread, but I could see them pushing their "UUID" into the CIB on start,
and then (in post-start notification) the side which finds the other
side's UUID not in its history would declare itself unable to become
master. (Similar to what you discussed with Andreas Kurz, but just
applies to "start" and not every monitor.)

> > (Of course I can construct a sequence of failures which would break even
> > that, to which I'd reply that they really should simply use the same
> > bonded interfaces for both their cluster traffic _and_ the replication,
> > to completely avoid this problem ;-)
> it does not need to be perfect.
> it just needs to be as good as "the hack" dopd.
> and preferably as simple.

To be honest, simply using the same links would be simpler. 

(Tangent: we have a similar issue for example with OCFS2 and the DLM,
and trying to tell the cluster that A can no longer talk to C is an icky
problem. There's no m/s state as with drbd, but the topology complexity
of N>2 makes up for this :-/)

On the other hand, that wouldn't clear up the cases where the
replication link is down because of some internal state hiccup, which
the approach outlined might help with.

> > I don't see the need for this second requirement. First, a not-connected
> > secondary in my example would never promote (unless it was primary
> > before, with the extension); second, if a primary exists, the cluster
> > would never promote a second one (that's protected against).
> we have to unconnected secondaries.
> for this example, lets even assume they are equivalent,
> have the very same data generation.
> we promote one of them. so the other needs to be outdated.

"No problem" ;-)

Pacemaker will deliver a "I am about to promote your peer" notification,
promote the peer, and then a "I just promoted the peer" notification.
So, it can use that notification to update its knowledge that the peer
is now ahead of it.

>  "you said hack first" ;)

Whom are you calling a hack!?!?!?! ;-)

> > My goal is to make the entire setup less complex, which means cutting
> > out as much as possible with the intent of making it less fragile and
> > easier to setup.
> I fully agree to that goal.

That's good, so at least we can figure it out from here ... And yes,
dopd is simple.

> > To be frank, the _real_ problem we're solving here is that drbd and the
> > cluster layer are (or at least can be) disconnected, and trying to
> > figure out how much of that is needed and how to achieve it.
> that is good. thats why I'm here, writing back,
> so you can sharpen you proposal on my rejection
> until its sharp enough.

It provides a great excuse from more boring work too. ;-)

> > If all meta-data communication were taken out of drbd's data channel
> > but instead routed through user-space and the cluster layer, none of
> > this could happen, and the in-kernel implementation probably quite
> > simplified. But that strikes me as quite a stretch goal for the time
> > being.
> you can have that. don't use drbd then. use md.
> but there was a reason that you did not.
> thats not really suitable for that purpose.
> right.

Right. But I also know you're looking at the future with the new
replication framework etc, and maybe we might want to reconsider this.
Afterall, we now _do_ have a "standard" for reliable cluster comms,
called openAIS, works between Oracle/RHT/GFS2/OCFS2/Novell/etc, we have
more powerful network block devices (iSCSI ...), so it might make sense
to leverage it, combine it with the knowledge of drbd's state machine
for replication, and make drbd-9 ;-) But yes, that's clearly longer
reach than what we're trying to discuss here. It always helps to look
towards the future though.

> please, don't give up yet.

I've been with this project for almost 9 years. I'm too stubborn to give
up ;-)

> for starting out with three dots,
> your proposal is amazingly good already.
> it just needs to simplify some more,

More simple is always good.

> and maybe get rid of required spurious restarts.

The restart of the secondary is not just "spurious" though. It might
actually help "fix" (or at least "reset") things. Restarts are amazingly
simple and effective.

For example, if the link broke due to some OS/module issue, the stop
might fail, and the node would actually get fenced, and reconnect
"fine". Or the stop might succeed, and the reinitialization on 'start'
is sufficient to clear things up. 

This might seem like "voodoo" and hand waving, but Gray&Reuter quote a
ratio between soft/transient to hard errors for software of about 100:1
- that is, restarting solves a _lot_ of problems. (Hence why STONITH
happens to be so effective in practice, while it is crude in theory.)

It also moves the policy decision to the, well, Policy Engine, where a
number of other recovery actions could be triggered - including those
"rare cases".

Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde