Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 2008-09-13T14:52:53, Lars Ellenberg <lars.ellenberg at linbit.com> wrote: > so. > what you are suggeting is > > when drbd loses replication link > primary > freezes and calles out to userland, > telling heartbeat that the peer has "failed", > in which case heartbeat would stop drbd on the secondary. > either receives "secondary was stopped", > maybe stores to meta data "_I_ am ahead of peer", > (useful for cluster wide crash/reboot later) > and unfreezes > or is being stopped itself > (which would result in the node being self fenced, as the fs on > top of drbd cannot be unmounted as drbd is freezed,...) > or is even being shot as result of a cluster partition. > > so either primary continues to write, > or it will soon look like a crashed primary. > > secondary > sets a flag "primary may be ahead of me", > then waits for > either being stopped, in which case > it would save to meta data "primary _IS_ ahead of me" > or being told that the Primary was stopped > when it would clear that flag again, > maybe store to meta data "_I_ am ahead of peer" > and then most likely soon after be promoted. Okay, I think I have found a good name for the flag which I mean, and that should allow me to rephrase more clearly, and possibly simplify further Retry: Unconnected secondary starts up with "peer state dirty", which is basically identical to "(locally) unsynced/inconsistent", but I think it's either to explain when I call it "peer dirty". After it connects (and possibly resynchronizes), it clears the "peer dirty" flag locally. (Assume that it both sides are secondary at this point; that'd be the default during a cluster start-up.) When one side gets promoted, the other side sets the "peer dirty" flag locally. When it demotes, both sides clear it. Basically, each side gets to clear it when it notices that the peer is demoted. So far, so good. Scenario A - the replication link goes down: - Primary: - Freezes IO. - Calls out to user-space to "fail" the peer. - Gets confirmation that peer is stopped (via RA notification). - Resumes IO. - Secondary: - Simply gets stopped. - It'll assume "peer dirty" anyway, until it reconnects and resyncs. Scenario A - primary fails: - Primary: - Is dead. ;-) Secondary: - Gets confirmation that peer is stopped. - Clears inconsistent flag (capable to resume IO). Scenario C - secondary fails: Primary: - Same as A, actually, from the point of view of the primary. Secondary: - Either gets fenced, or stopped. (Note that A/B/C could actually work for active/active too, as long as there's a way to ensure that only one side calls out to fail its peer, and the other one - for the sake of this scenario - behaves like a secondary.) > some questions: > wouldn't that "peer has failed" first trigger a monitor? No; it'd translate to a direct stop. > wouldn't that mean that on monitor, a not connected secondary would > have to report "failed", as otherwise it would not get stopped? > wouldn't that prevent normal failover? Monitoring definitions are a slightly different matter. The result of a monitor is not the same as the ability/preference to become master. Indeed a failed resource will never get promoted, but a happy resource needn't call crm_master and thus not become promotable. I think "monitor" would refer exclusively to local health - local storage read/writable, drbd running, etc. > if not, > wouldn't heartbeat try to restart the "failed" secondary? > what would happen? It might try to restart. But if a secondary gets restarted, it'll know from the environment variables that a peer exists; if it can't connect to that, it should fail the start - alternatively, it'd be up and running, but have "outdated/peer is dirty" set anyway, and so never announce it's ability to "promote". > what does a secondary do when started, and it finds the > "primary IS ahead of me" flag in meta data? > refuse to start even as slave? > (would prevent it from ever being resync'ed!) > start as slave, but refuse to be promoted? The latter. > problem: secondary crash. > secondary reboots, > heartbeat rejoins the cluster. > > replication link is still broken. > > secondary does not have "primary IS ahead of me" flag in meta data > as because of the crash there was no way to store that. > would heartbeat try to start drbd (slave) here? > what would trigger the "IS ahead of me" flag get stored on disk? See above; it would _always_ come up with the assumption that "peer is dirty", and thus refuse to promote. No need to store anything on disk; it is the default assumption. > if for some reason policy engine now figures the master should rather > run on the just rejoined node, how can that migration be prevented? That's a different discussion, but: the ability (and preference for) to become primary is explicitly set by the RA through the call to "crm_master". If it is unable to become master, it would call "crm_master -D"; it'll then _never_ be promoted. > I'm still not convinced that this method > covers as many as dopd as good as dopd. I think so. At least my proposal is becoming more concise, which is good for review ;-) Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde