[Linux-ha-dev] Re: [DRBD-user] drbd peer outdater: higher level implementation?

Sun Sep 14 21:02:20 CEST 2008

On Sun, Sep 14, 2008 at 06:10:50PM +0200, Lars Marowsky-Bree wrote:

<sniped some parts where we most likely agree,
 or where it is "only opinion"/>

> Anyway, I can actually point out a few cases: the restart might happen
> on another node, which dopd can't achieve.  The (possibly forced)
> restart might clear up some internal state within drbd (or the whole
> system) which might allow it to reconnect.
> 
> The former is my pet project of floating peers, but the latter is not
> that unlikely, either. Many errors are transient and are solved by a
> restart (if not even a reboot).
> 
> > The typical deployment with DRBD is still
> > two nodes with direct attached storage each.
> 
> Yes. I understand that. I think it likely accounts for 95% of all our
> customer deployments, if not 98%. This is partly because drbd is hard to
> extend to other scenarios right now though, not because there would not
> be a need for it.
> 
> > if you are arguing floating peers around SANs,
> > then that is a rare special case.
> 
> Yes, of course I know. I'm not sure it would stay as rare as that,
> though. But it would enable drbd to be used in rather interesting and
> more ... lucrative deployments, too.

while drbd is able to enhance SANs,
it is actually out there to replace them ;)

> > if you are arguing "cold spare" drbd nodes with DAS,
> > (which is an even more rare special case) you really think that getting
> > rid of dopd was worth a full sync every time we have a network kickup?
> 
> Well, first, this is not that rare in demand, though of course not
> easily possible today. Some customers with bladecenters would quite like
> this. 
> 
> Second, no, of course not every time. Pacemaker could be configured to
> try up to, say, 3 local restarts within 24 hours before doing a
> fail-over to another node - and suddenly the deployment sounds a bit
> more attractive ...
> 
> For these last two "rare" scenarios, the fail-over might not just be
> caused by the replication link going down, but also by the node losing
> its storage or the connection to it, in which case a fail-over is quite
> desirable.
> 
> > > See above for one possible solution.
> > 
> > I'm just pointing out that you started out claiming
> >  "I actually think that dopd is the real hack, and drbd instead should
> >   listen to the notifications we provide, and infer the peer state by that
> >   means ..."
> > I accused you of handwaving, and you said no, it is all perfectly clear.
> 
> Well, I admit to having been wrong on the "perfectly clear". I thought
> it was clear, and the elaborate discussion is quite helpful.
> 
> And calling dopd the real hack might have been offensive, for which I
> apologize. But I'd still like understand if we could do without it, and
> possibly even achieve more.
> 
> > now, when it comes to fill in those dots,
> > you need to reconsider again and again.
> 
> Right. That tends to happen during a dialogue - it would make me look
> rather silly if I ignored new insights, wouldn't it? ;-)
> 
> > while dopd is already there.
> 
> Yes, it's there for heartbeat, but it is not there at all for openAIS,

I have been told someone is working on this, though.
and its not too difficult.
but yes, as I mentione early in this thread,
I'm happy to replace the method of communication (dopd)
with something more "high level", like a combination of "crm fail"
commands and notificatoin events, if it does not get too,
how shal I put it, "artificial".
but we are getting closer already.

> and I don't think it works well with the m/s resources (which I'm also
> trying to improve here). So I'm looking how we could achieve this for
> the future.
> 
> (Personally, I consider heartbeat as the cluster comm layer as dead as
> you think drbd-0.7 to be; it won't be around on SLE11, for example. So
> we really need to find out how we could merge the two well.)

> This would, as it relies "only" on Pacemaker, continue working on top of
> the heartbeat comm-layer of course too.
> 
> > and even if is is a "hack", does a pretty good job.
> 
> True.
> 
> > good.
> > I see you start to look at the whole picture ;)
> 
> I'm always happy to learn ;-)
> 
> > btw.
> > there are no generation counters anymore.
> > there is a reason for that: they are not reliable.
> > drbd8 does compare a history of generation UUIDs.  while still not
> > perfect, it is much more reliable than generation counters.
> 
> Good to know.
> 
> But, even if this is somewhat unrelated to the "outdated" discussion,
> how would you translate this to the "N1+N2" (ie, two former primaries)
> recovery scenario?

what we currently do?

- Primary N1 crashes
- Secondary N2 gets promoted
         * at which point it knows it is ahead of N1,
	   and stores that even in meta data *
- Cluster crash
- Replication link down
- Both nodes N1+N2 up

- N1 does know it comes up after primary crash,
  so if asked to be promoted,
  it first tries (via dopd) to outdate N2
- N2 does know it comes up after primary crash,
  and that it has newer data than N1

in addition, because of how wfc-timeout and degr-wfc-timeout work,
the drbd initscript would use wfc-timeout on N1 (which is "forever" by
default), but degr-wfc-timeout on N2 (which is a finite time by default)

so yes, N2 would be promoted, because
 * it knows that it is ahead of N1,
 * N1 would wait for the connection (much longer) before even continuing
   the boot process.

in case the wfc-timeouts are both set to "very short",
to correctly deal with this situation,
a node (N2) that knows it is ahead of the other must refuse to be "outdated"
by the known outdated node (N1), even if that node (N1) does not know already,
and even if not currently primary (N2, before being promoted).
we should correctly implement that, but I need to double check.

> Compared to heartbeat v1, at least Pacemaker would
> allow you to declare your preference for becoming primary, but that
> needs to be numeric (higher integer wins). Maybe worth a separate
> thread, but I could see them pushing their "UUID" into the CIB on start,
> and then (in post-start notification) the side which finds the other
> side's UUID not in its history would declare itself unable to become
> master. (Similar to what you discussed with Andreas Kurz, but just
> applies to "start" and not every monitor.)
> 
> > > (Of course I can construct a sequence of failures which would break even
> > > that, to which I'd reply that they really should simply use the same
> > > bonded interfaces for both their cluster traffic _and_ the replication,
> > > to completely avoid this problem ;-)
> > it does not need to be perfect.
> > it just needs to be as good as "the hack" dopd.
> > and preferably as simple.
> 
> To be honest, simply using the same links would be simpler. 

then we are back to "true" split brain scenarios.
and discussing quorum in a two-node cluster.

sure that would be simpler.
but it would cause either no-availability
or data divergence every time that link breaks.

> (Tangent: we have a similar issue for example with OCFS2 and the DLM,
> and trying to tell the cluster that A can no longer talk to C is an icky
> problem. There's no m/s state as with drbd, but the topology complexity
> of N>2 makes up for this :-/)
> 
> On the other hand, that wouldn't clear up the cases where the
> replication link is down because of some internal state hiccup, which
> the approach outlined might help with.
> 
> > > I don't see the need for this second requirement. First, a not-connected
> > > secondary in my example would never promote (unless it was primary
> > > before, with the extension); second, if a primary exists, the cluster
> > > would never promote a second one (that's protected against).
> > we have to unconnected secondaries.
> > for this example, lets even assume they are equivalent,
> > have the very same data generation.
> > we promote one of them. so the other needs to be outdated.
> 
> "No problem" ;-)
> 
> Pacemaker will deliver a "I am about to promote your peer" notification,
> promote the peer, and then a "I just promoted the peer" notification.
> So, it can use that notification to update its knowledge that the peer
> is now ahead of it.

ok.

> >  "you said hack first" ;)
> 
> Whom are you calling a hack!?!?!?! ;-)
> 
> > > My goal is to make the entire setup less complex, which means cutting
> > > out as much as possible with the intent of making it less fragile and
> > > easier to setup.
> > I fully agree to that goal.
> 
> That's good, so at least we can figure it out from here ... And yes,
> dopd is simple.
> 
> > > To be frank, the _real_ problem we're solving here is that drbd and the
> > > cluster layer are (or at least can be) disconnected, and trying to
> > > figure out how much of that is needed and how to achieve it.
> > that is good. thats why I'm here, writing back,
> > so you can sharpen you proposal on my rejection
> > until its sharp enough.
> 
> It provides a great excuse from more boring work too. ;-)

you know, I have a paper to write...
and keep avoiding that for weeks now.

> > > If all meta-data communication were taken out of drbd's data channel
> > > but instead routed through user-space and the cluster layer, none of
> > > this could happen, and the in-kernel implementation probably quite
> > > simplified. But that strikes me as quite a stretch goal for the time
> > > being.
> > you can have that. don't use drbd then. use md.
> > but there was a reason that you did not.
> > thats not really suitable for that purpose.
> > right.
> 
> Right. But I also know you're looking at the future with the new
> replication framework etc, and maybe we might want to reconsider this.
> Afterall, we now _do_ have a "standard" for reliable cluster comms,
> called openAIS, works between Oracle/RHT/GFS2/OCFS2/Novell/etc, we have
> more powerful network block devices (iSCSI ...), so it might make sense
> to leverage it, combine it with the knowledge of drbd's state machine
> for replication, and make drbd-9 ;-) But yes, that's clearly longer
> reach than what we're trying to discuss here. It always helps to look
> towards the future though.

absolutly.  I'm going to extend drbd UUIDs to something I call
"monotonic storage time" lacking a better term. as long as that can be
comunicated somehow, each node knows exactly whether it lags behind, and
how much.  Its all in that unwritten paper ;)

> > please, don't give up yet.
> 
> I've been with this project for almost 9 years. I'm too stubborn to give
> up ;-)

I thought so :)

> > for starting out with three dots,
> > your proposal is amazingly good already.
> > it just needs to simplify some more,
> 
> More simple is always good.
> 
> > and maybe get rid of required spurious restarts.
> 
> The restart of the secondary is not just "spurious" though. It might
> actually help "fix" (or at least "reset") things. Restarts are amazingly
> simple and effective.

hmm.

> For example, if the link broke due to some OS/module issue, the stop
> might fail, and the node would actually get fenced, and reconnect
> "fine". Or the stop might succeed, and the reinitialization on 'start'
> is sufficient to clear things up. 
> 
> This might seem like "voodoo" and hand waving, but Gray&Reuter quote a
> ratio between soft/transient to hard errors for software of about 100:1
> - that is, restarting solves a _lot_ of problems. (Hence why STONITH
> happens to be so effective in practice, while it is crude in theory.)
> 
> It also moves the policy decision to the, well, Policy Engine, where a
> number of other recovery actions could be triggered - including those
> "rare cases".

ok, you modify "your" ocf drbd RA as a proof of concept?

according to your proposal,
on the drbd part,
we'd only need to replace the outdate-peer-handler
from "drbd-peer-outdater" to "some other program calling crm fail
appropriately and block until confirmed".

thats just an entry in the config file
(and someone needs to write that script).

later we may make it easier for the script by
extending the logic in the drbd module,
to make it easier for asynchonous confirmation.

-- 
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH