[Linux-ha-dev] Re: [DRBD-user] drbd peer outdater: higher level implementation?

Sat Sep 13 18:55:05 CEST 2008

On 2008-09-13T14:52:53, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:

> so.
> what you are suggeting is
> 
> when drbd loses replication link
>   primary
>      freezes and calles out to userland,
>        telling heartbeat that the peer has "failed",
>        in which case heartbeat would stop drbd on the secondary.
>      either receives "secondary was stopped",
>         maybe stores to meta data "_I_ am ahead of peer",
>             (useful for cluster wide crash/reboot later)
> 	and unfreezes
>      or is being stopped itself
>        (which would result in the node being self fenced, as the fs on
>         top of drbd cannot be unmounted as drbd is freezed,...)
>      or is even being shot as result of a cluster partition.
> 
>      so either primary continues to write,
>      or it will soon look like a crashed primary.
> 
>   secondary
>     sets a flag "primary may be ahead of me",
>     then waits for
>     either being stopped, in which case
>       it would save to meta data "primary _IS_ ahead of me"
>     or being told that the Primary was stopped
>       when it would clear that flag again,
>         maybe store to meta data "_I_ am ahead of peer"
>       and then most likely soon after be promoted.

Okay, I think I have found a good name for the flag which I mean, and
that should allow me to rephrase more clearly, and possibly simplify
further Retry:

Unconnected secondary starts up with "peer state dirty", which is
basically identical to "(locally) unsynced/inconsistent", but I think
it's either to explain when I call it "peer dirty".

After it connects (and possibly resynchronizes), it clears the
"peer dirty" flag locally. (Assume that it both sides are secondary at
this point; that'd be the default during a cluster start-up.)

When one side gets promoted, the other side sets the "peer dirty" flag
locally.  When it demotes, both sides clear it. Basically, each side
gets to clear it when it notices that the peer is demoted.  So far, so
good.

Scenario A - the replication link goes down:

- Primary:
  - Freezes IO.
  - Calls out to user-space to "fail" the peer.
  - Gets confirmation that peer is stopped (via RA notification).
  - Resumes IO.

- Secondary:
  - Simply gets stopped.
  - It'll assume "peer dirty" anyway, until it reconnects and
    resyncs.

Scenario A - primary fails:

- Primary:
  - Is dead. ;-)

Secondary:
  - Gets confirmation that peer is stopped.
  - Clears inconsistent flag (capable to resume IO).

Scenario C - secondary fails:

Primary:
- Same as A, actually, from the point of view of the primary.

Secondary:
- Either gets fenced, or stopped.

(Note that A/B/C could actually work for active/active too, as long as
there's a way to ensure that only one side calls out to fail its peer,
and the other one - for the sake of this scenario - behaves like a
secondary.)

> some questions:
>   wouldn't that "peer has failed" first trigger a monitor?

No; it'd translate to a direct stop.

>   wouldn't that mean that on monitor, a not connected secondary would
>   have to report "failed", as otherwise it would not get stopped?
>   wouldn't that prevent normal failover?

Monitoring definitions are a slightly different matter. The result of a
monitor is not the same as the ability/preference to become master.
Indeed a failed resource will never get promoted, but a happy resource
needn't call crm_master and thus not become promotable.

I think "monitor" would refer exclusively to local health - local
storage read/writable, drbd running, etc. 

>   if not,
>   wouldn't heartbeat try to restart the "failed" secondary?
>   what would happen?

It might try to restart. But if a secondary gets restarted, it'll know
from the environment variables that a peer exists; if it can't connect
to that, it should fail the start - alternatively, it'd be up and
running, but have "outdated/peer is dirty" set anyway, and so never
announce it's ability to "promote".

>   what does a secondary do when started, and it finds the
>     "primary IS ahead of me" flag in meta data?
>     refuse to start even as slave?
>       (would prevent it from ever being resync'ed!)
>     start as slave, but refuse to be promoted?

The latter.

> problem: secondary crash.
>    secondary reboots,
>    heartbeat rejoins the cluster.
>    
>    replication link is still broken.
> 
>    secondary does not have "primary IS ahead of me" flag in meta data
>    as because of the crash there was no way to store that.

>    would heartbeat try to start drbd (slave) here?
>    what would trigger the "IS ahead of me" flag get stored on disk?

See above; it would _always_ come up with the assumption that "peer is
dirty", and thus refuse to promote. No need to store anything on disk;
it is the default assumption.

>    if for some reason policy engine now figures the master should rather
>    run on the just rejoined node, how can that migration be prevented?

That's a different discussion, but: the ability (and preference for) to
become primary is explicitly set by the RA through the call to
"crm_master".

If it is unable to become master, it would call "crm_master -D"; it'll
then _never_ be promoted.

> I'm still not convinced that this method
> covers as many as dopd as good as dopd.

I think so. At least my proposal is becoming more concise, which is good
for review ;-)

Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde