Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 2008-09-14T16:31:36, Lars Ellenberg <lars.ellenberg at linbit.com> wrote: > > I meant "eventually", ie sometime the admin is going to fix it and then > > it'll be able to reconnect and resync, and clear the flag. > admin intervention required for a network hickup. > not an option. Depends on how the cluster is configured. A restart can happen automatically, too. And, with floating peers implemented right, possibly on another node - which some customers might like. > > Well, of course I'm describing the target scenario, not the current one. > > I entirely agree that that is possible right now. > sure. but we have dopd. and it covers this. > master/slave notifications alone, as was your original proposal, > certainly cannot, as you meanwhile noticed. > your current, combined proposal involving the notifications for some > part, and calling out to "fail" a node, i.e. stop the secondary because > of a network hickup, is worse than dopd. I'm not sure it is. It achieves what dopd does w/o needing dopd, but by interacting with the cluster layer only, and by, I think, simplyfing drbd's logic - I think that's a win. And yes, for the cluster to notice that something needs to be done about the replication link going down _requires_ some call-out of some form. That's internal state of drbd the cluster naturally doesn't have access to, so it must be communicated somehow. However, dopd still remains internal and unknown to the cluster; so the cluster's policy system can't help with the recovery. Exposing this might have some charm, described further below. The notifications provide each clone (or m/s) instance with the cluster's state about the peers and intended/completed state changes, so I think those are useful. > you try to concinve me to stay with dopd ;) Well, yes; that is one possible result of exploring the other alternatives, at least then we'll all agree and understand why that is the case. Or even to identify that we might have scenarios where dopd is needed, and others where a different approach is recommended ... > > Second, even if Pacemaker would restart the secondary (which was stopped > > due to the failure), the secondary would be unable to promote as "the > > flag" would be set by default on start-up. > > > > I really believe that the approach I described covers this. > > and needs admin intervention to start the secondary again, > just because some switch reset. How so? In fact, the default response by Pacemaker to a failed resource is to _restart_ it. No admin intervention required. But that's tunable; it could be set to restart it only N times within M seconds, or fail-over to a different node etc - that strikes me as quite powerful enough. > > First, the theoretical response to this is that replication link down > > plus crash of two nodes actually constitutes a triple failure, and thus > > not one we claim the cluster protects against. ;-) For some customers, > > manual intervention here would be acceptable. > > dopd handles it, your proposal does not. > dopd is already there. > dopd is simpler. > dopd wins. I disagree. I was merely pointing out the fact that we can always construct failure sequences which are not satisfactorily solved. For example, in your scenario, if after the cluster crash only the old secondary comes back, I am _sure_ there is someone out there who'd rather continue serving data with possibly a few transactions missing than not at all, which would require an admin to step in and clear the outdated flag. (And no, I have no answer to this case ;-) > > But second, a possible solution is to write a persistent "I was primary" > > flag to the meta-data. > already there. Perfect ;-) Then nothing further is needed. > > On start, this would then set crm_master's > > preference to non-zero value (say, 1), which would allow the node to be > > promoted. This might be a tunable operation. > absolutely not. for a realy primary crash, the secondary was promoted. > cluster crash. both have their "I was primary" flag set. > its not that simple. I know that. As I've pointed out later describing _exactly this scenario_, they could instead use the "number of primary transitions seen" (acknowleding the short-comings of g-c's, but which would be quite reasonable here), which would make Pacemaker prefer the more "recent" node. But actually, in this case, neither side would have the "outdated" flag set either, so we're actually discussing something not quite related to dopd anyway, aren't we? > > > but that must not happen. how could it then possibly resync, ever? > > Pacemaker can be configured to restart it too, which would attempt a > > reconnect (or even attempt the reconnect periodically, if the RA would > > fail to start if unable to connect to the peer, but that might not even > > be needed - restarting it once and keeping it running is sufficient). > > drbd attempts to reconnect on its own. Exactly. Hence why it would do that after a restart. > dopd does not need to restart it. > dopd wins. "restarting it" is merely a way of achieving the dopd functionality w/o needing dopd. _Of course_ dopd can already do that. If you're going to critique my proposal on the basis that it doesn't do more than dopd, of course you're going to be right. ;-) Anyway, I can actually point out a few cases: the restart might happen on another node, which dopd can't achieve. The (possibly forced) restart might clear up some internal state within drbd (or the whole system) which might allow it to reconnect. The former is my pet project of floating peers, but the latter is not that unlikely, either. Many errors are transient and are solved by a restart (if not even a reboot). > The typical deployment with DRBD is still > two nodes with direct attached storage each. Yes. I understand that. I think it likely accounts for 95% of all our customer deployments, if not 98%. This is partly because drbd is hard to extend to other scenarios right now though, not because there would not be a need for it. > if you are arguing floating peers around SANs, > then that is a rare special case. Yes, of course I know. I'm not sure it would stay as rare as that, though. But it would enable drbd to be used in rather interesting and more ... lucrative deployments, too. > if you are arguing "cold spare" drbd nodes with DAS, > (which is an even more rare special case) you really think that getting > rid of dopd was worth a full sync every time we have a network kickup? Well, first, this is not that rare in demand, though of course not easily possible today. Some customers with bladecenters would quite like this. Second, no, of course not every time. Pacemaker could be configured to try up to, say, 3 local restarts within 24 hours before doing a fail-over to another node - and suddenly the deployment sounds a bit more attractive ... For these last two "rare" scenarios, the fail-over might not just be caused by the replication link going down, but also by the node losing its storage or the connection to it, in which case a fail-over is quite desirable. > > See above for one possible solution. > > I'm just pointing out that you started out claiming > "I actually think that dopd is the real hack, and drbd instead should > listen to the notifications we provide, and infer the peer state by that > means ..." > I accused you of handwaving, and you said no, it is all perfectly clear. Well, I admit to having been wrong on the "perfectly clear". I thought it was clear, and the elaborate discussion is quite helpful. And calling dopd the real hack might have been offensive, for which I apologize. But I'd still like understand if we could do without it, and possibly even achieve more. > now, when it comes to fill in those dots, > you need to reconsider again and again. Right. That tends to happen during a dialogue - it would make me look rather silly if I ignored new insights, wouldn't it? ;-) > while dopd is already there. Yes, it's there for heartbeat, but it is not there at all for openAIS, and I don't think it works well with the m/s resources (which I'm also trying to improve here). So I'm looking how we could achieve this for the future. (Personally, I consider heartbeat as the cluster comm layer as dead as you think drbd-0.7 to be; it won't be around on SLE11, for example. So we really need to find out how we could merge the two well.) This would, as it relies "only" on Pacemaker, continue working on top of the heartbeat comm-layer of course too. > and even if is is a "hack", does a pretty good job. True. > good. > I see you start to look at the whole picture ;) I'm always happy to learn ;-) > btw. > there are no generation counters anymore. > there is a reason for that: they are not reliable. > drbd8 does compare a history of generation UUIDs. while still not > perfect, it is much more reliable than generation counters. Good to know. But, even if this is somewhat unrelated to the "outdated" discussion, how would you translate this to the "N1+N2" (ie, two former primaries) recovery scenario? Compared to heartbeat v1, at least Pacemaker would allow you to declare your preference for becoming primary, but that needs to be numeric (higher integer wins). Maybe worth a separate thread, but I could see them pushing their "UUID" into the CIB on start, and then (in post-start notification) the side which finds the other side's UUID not in its history would declare itself unable to become master. (Similar to what you discussed with Andreas Kurz, but just applies to "start" and not every monitor.) > > (Of course I can construct a sequence of failures which would break even > > that, to which I'd reply that they really should simply use the same > > bonded interfaces for both their cluster traffic _and_ the replication, > > to completely avoid this problem ;-) > it does not need to be perfect. > it just needs to be as good as "the hack" dopd. > and preferably as simple. To be honest, simply using the same links would be simpler. (Tangent: we have a similar issue for example with OCFS2 and the DLM, and trying to tell the cluster that A can no longer talk to C is an icky problem. There's no m/s state as with drbd, but the topology complexity of N>2 makes up for this :-/) On the other hand, that wouldn't clear up the cases where the replication link is down because of some internal state hiccup, which the approach outlined might help with. > > I don't see the need for this second requirement. First, a not-connected > > secondary in my example would never promote (unless it was primary > > before, with the extension); second, if a primary exists, the cluster > > would never promote a second one (that's protected against). > we have to unconnected secondaries. > for this example, lets even assume they are equivalent, > have the very same data generation. > we promote one of them. so the other needs to be outdated. "No problem" ;-) Pacemaker will deliver a "I am about to promote your peer" notification, promote the peer, and then a "I just promoted the peer" notification. So, it can use that notification to update its knowledge that the peer is now ahead of it. > "you said hack first" ;) Whom are you calling a hack!?!?!?! ;-) > > My goal is to make the entire setup less complex, which means cutting > > out as much as possible with the intent of making it less fragile and > > easier to setup. > I fully agree to that goal. That's good, so at least we can figure it out from here ... And yes, dopd is simple. > > To be frank, the _real_ problem we're solving here is that drbd and the > > cluster layer are (or at least can be) disconnected, and trying to > > figure out how much of that is needed and how to achieve it. > that is good. thats why I'm here, writing back, > so you can sharpen you proposal on my rejection > until its sharp enough. It provides a great excuse from more boring work too. ;-) > > If all meta-data communication were taken out of drbd's data channel > > but instead routed through user-space and the cluster layer, none of > > this could happen, and the in-kernel implementation probably quite > > simplified. But that strikes me as quite a stretch goal for the time > > being. > you can have that. don't use drbd then. use md. > but there was a reason that you did not. > thats not really suitable for that purpose. > right. Right. But I also know you're looking at the future with the new replication framework etc, and maybe we might want to reconsider this. Afterall, we now _do_ have a "standard" for reliable cluster comms, called openAIS, works between Oracle/RHT/GFS2/OCFS2/Novell/etc, we have more powerful network block devices (iSCSI ...), so it might make sense to leverage it, combine it with the knowledge of drbd's state machine for replication, and make drbd-9 ;-) But yes, that's clearly longer reach than what we're trying to discuss here. It always helps to look towards the future though. > please, don't give up yet. I've been with this project for almost 9 years. I'm too stubborn to give up ;-) > for starting out with three dots, > your proposal is amazingly good already. > it just needs to simplify some more, More simple is always good. > and maybe get rid of required spurious restarts. The restart of the secondary is not just "spurious" though. It might actually help "fix" (or at least "reset") things. Restarts are amazingly simple and effective. For example, if the link broke due to some OS/module issue, the stop might fail, and the node would actually get fenced, and reconnect "fine". Or the stop might succeed, and the reinitialization on 'start' is sufficient to clear things up. This might seem like "voodoo" and hand waving, but Gray&Reuter quote a ratio between soft/transient to hard errors for software of about 100:1 - that is, restarting solves a _lot_ of problems. (Hence why STONITH happens to be so effective in practice, while it is crude in theory.) It also moves the policy decision to the, well, Policy Engine, where a number of other recovery actions could be triggered - including those "rare cases". Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde