[Linux-ha-dev] Re: [DRBD-user] drbd peer outdater: higher level implementation?

Sat Sep 13 03:01:11 CEST 2008

On 2008-09-12T23:55:53, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:

> > When drbd loses the peer internally, but w/o us providing the
> > notification, it's either the replication link crashed, or fencing
> > failing or loss of quorum; anyway, you'd "outdate" yourself (and freeze
> > io) until this notification was provided (which of course needs to be
> > persistent across reboots).
> > 
> > Wouldn't that work?
> 
> that would prevent normal failover, no?

No. Normal fail-over will only occur after 'we' have demoted/stopped the
peer. The cluster manager is quite good at enforcing dependencies ;-)

> what we need is,
>  * on the "Secondary", "slave",
>    or whatever you want to call it,
>  * the signal of the peer, that says:
>    hey, I'm still alive, I'm still Primary,
>    and continue to modify the data set,
>    so you better keep out of the way.
> then we mark us as outdated.

Pacemaker/CRM doesn't send signals when nothing changed, so this would
be a weird thing for it to deliver. However, it _will_ tell you when
something changed, ie the logic simply needs to be turned around.

> I don't think that this can be mapped into
> multiple negation plus timeout logic effectively.

I don't think this needs a timeout.

> do you suggest that,
>  * on the Secondary
>  * we get no signal that the peer is not dead in no time,
>    and therefore don't mark ourself as not uptodate?
> uh?

On the secondary, until you get a signal that the peer is dead
(stopped/demoted), consider yourself "not eligible" to be promoted (ie,
outdated).

More generally: on a primary, if the connection to the peer goes away,
set said flag & freeze IO until this signal/notification is delivered.

I believe that covers all of the cases. I may be wrong. We need a
whiteboard. I will make sure we have one in Prague! ;-)

> situation 1:
> 
> 	primary crash.
> 	secondary has to take over,
> 	so it better not mark itself outdated.

No problem; we'll deliver a "peer is stopped" notification to the
secondary so it won't be outdated by the time we ask it to promote.

> situation 2:
> 
> 	replication link breaks
> 	primary wants to continue serving data.
> 	so secondary must mark itself outdated.
> 	otherwise on a later primary crash heartbeat would try to make
> 	it primary and succeed in going online with stale data.

Right. The logic above would protect the data, but if just the
replication link freezes, this would freeze both nodes. Not good,
obviously. Indeed that requires some additional logic.

One possible way is to not freeze IO on the primary; the secondary would
still outdate itself implicitly, and then fail its monitor, and be
stopped (and moved elsewhere, if we could ;-). That seems correct, and
not worse than anything dopd does today; freeze-io probably is an
additional "panic guard".

BTW, when it fails the "monitor", we'll stop it. That could for example
un-freeze the primary. An alternative is to use crm_resource -F as a
call-out when drbd notices the master is gone, which would provide
Pacemaker with an async failure notification and prevent the timeouts
...

> 	that is why dopd was invented in the first place.

Yes, I know.

> variation:
> 	as it may be a cluster partition.
> 	with stonith, (at least) one of the nodes gets shot.
> 	primary must freeze until peer is confirmed outdated (or shot)
> 	and must unfreeze again as soon as peer is confirmed outdated (or shot)

We can't confirm it's outdated, but we can tell you when the peer is
shot/stopped.

> where and when do what notifications come in,

That's explained here:
http://wiki.linux-ha.org/v2/Concepts/Clones#head-f9fa0f9ab22e08d82c8f00e15d9724eba47f7576

> and how is drbd (the RA) to react on those?

See above. How to actually provide the signals to drbd (the module ;-)
is of course open to discussion, and I look to you as to understand what
works best.

> I recently discussed with our Andreas Kurz, that
> what _could_ possibly work is a "monitor" action,
> (and optionally some daemon)
> that periodically gets the "data generation uuids" from drbd
> and feed that into the cib (reuse attrd?)

I think that is way too complicated and not needed; I think the
notifications are sufficient, as they provide the peer up/down
promote/demote events. But I may be wrong.

> so I'll try again after I got some sleep.

Good point ;-) I will do the same. And, as I mentioned, bring a
whiteboard to Prague.

If I can explain this so that it works, can I have my floating peers
supported in exchange? ;-)

Regards & good night,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde