[Linux-ha-dev] Re: [DRBD-user] drbd peer outdater: higher level implementation?

Mon Sep 15 00:56:22 CEST 2008

On 2008-09-14T21:02:20, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:

> <sniped some parts where we most likely agree,
>  or where it is "only opinion"/>

Yes, thanks, good idea, time to shrink the discussion down some.

> > Yes, of course I know. I'm not sure it would stay as rare as that,
> > though. But it would enable drbd to be used in rather interesting and
> > more ... lucrative deployments, too.
> while drbd is able to enhance SANs,
> it is actually out there to replace them ;)

I know ;-) But using it to replicate between SANs (or having standby
systems for the replica with DAS) brings us closer to the point where we
can deliver "split-site" clusters.

> > Yes, it's there for heartbeat, but it is not there at all for openAIS,
> I have been told someone is working on this, though.

Oh?

> and its not too difficult.

Right, as dopd uses only mostly simple message exchange, it wouldn't be.
Good time though to reassess.

> what we currently do?
> 
> - Primary N1 crashes
> - Secondary N2 gets promoted
>          * at which point it knows it is ahead of N1,
> 	   and stores that even in meta data *
> - Cluster crash
> - Replication link down
> - Both nodes N1+N2 up
> 
> - N1 does know it comes up after primary crash,
>   so if asked to be promoted,
>   it first tries (via dopd) to outdate N2
> - N2 does know it comes up after primary crash,
>   and that it has newer data than N1
> 
> in addition, because of how wfc-timeout and degr-wfc-timeout work,
> the drbd initscript would use wfc-timeout on N1 (which is "forever" by
> default), but degr-wfc-timeout on N2 (which is a finite time by default)
> 
> so yes, N2 would be promoted, because
>  * it knows that it is ahead of N1,
>  * N1 would wait for the connection (much longer) before even continuing
>    the boot process.
> 
> in case the wfc-timeouts are both set to "very short",
> to correctly deal with this situation,
> a node (N2) that knows it is ahead of the other must refuse to be "outdated"
> by the known outdated node (N1), even if that node (N1) does not know already,
> and even if not currently primary (N2, before being promoted).
> we should correctly implement that, but I need to double check.

Not exactly trivial either, but then, it is not exactly a trivial
failure sequence. Thanks for the explanation.

> > To be honest, simply using the same links would be simpler. 
> 
> then we are back to "true" split brain scenarios.
> and discussing quorum in a two-node cluster.
> 
> sure that would be simpler.
> but it would cause either no-availability
> or data divergence every time that link breaks.

Right; note how my proposal works for "true" split-brain too, of course.

> > It provides a great excuse from more boring work too. ;-)
> you know, I have a paper to write...
> and keep avoiding that for weeks now.

I only have a few more paragraphs to write for my part-time studies
today.  Anything else then suddenly becomes so much more attractive. I
wonder how many open source projects harness the power of
procrastination.

> absolutly.  I'm going to extend drbd UUIDs to something I call
> "monotonic storage time" lacking a better term. as long as that can be
> comunicated somehow, each node knows exactly whether it lags behind, and
> how much.  Its all in that unwritten paper ;)

Sounds interesting; is that the Linux Kongress paper?

> > The restart of the secondary is not just "spurious" though. It might
> > actually help "fix" (or at least "reset") things. Restarts are amazingly
> > simple and effective.
> hmm.

You've got to admit that it's a valid point ;-)

> ok, you modify "your" ocf drbd RA as a proof of concept?

Yes, I can do that.

> according to your proposal,
> on the drbd part,
> we'd only need to replace the outdate-peer-handler
> from "drbd-peer-outdater" to "some other program calling crm fail
> appropriately and block until confirmed".

Does drbd on the primary side indeed freeze IO until that script
returns?

And I think the need for the secondary to not allow itself to be
promoted as I described might need to be implemented in drbd. Hrm. I
think I could work-around this by setting the "outdated" flag if
stoppd while disconnected ...

> thats just an entry in the config file
> (and someone needs to write that script).

That script should be easy too; not pretty, but easy ...

> later we may make it easier for the script by
> extending the logic in the drbd module,
> to make it easier for asynchonous confirmation.

I'd probably make the script block and then have the notification signal
it to continue.

Ok. I'll try to get to this this week, but I might not make it until
Wednesday or so. (I'm doing a half-week and thus need to cram a bit.) If
someone else wants to give it a shot before that, be my guest ;-)

Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde