Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Felix, Andreas (andreask) and I have been doing some thinking here, and our conclusion is essentially that you are performing an administrative task aka human intervention (namely, putting a node in standby) and thus it can be expected that you do check the DRBD state before doing so. We've been unable to come up with a scenario in which you would end up in the state you described _automatically_ (i.e. without human intervention), but if you can think of one please do share. The one you did mention doesn't fulfill this criterion -- see below. > Apart from implementing my own "crm node standby" wrapper script, can I > somehow configure pacemaker to give me a hint to the effect of "you're > trying to put node X on standby, but trust me, it's not a good idea"? This is something that has been discussed here previously. A resource agent could report to Pacemaker during monitor (via an exit code named, say, OCF_ERR_DEGRADED) that a resource, or a resource instance in the case of a clone or master/slave set, is in a degraded state. Pacemaker could then disregard that state for PE recalculations altogether (unlike a failure), but issue an event, record the information in the CIB status section and display a degradation warning in crm_mon. Then, the shell could issue warnings if you attempt to put a node in standby, or you could just take a look at crm_mon before you put a node in standby. This is something that I always do, and I presume others do as well. That would make it a lot harder for you to shoot yourself in the foot. Which leads us back to the discussion about an updated OCF RA spec. Lars, I may be getting on your nerves about this, but an update regarding your status would be much appreciated. > In the end, this may probably turn out to be a question of policy, as in > "the documentation to my sys-ops staff should include a mandatory check > of the DRBD status before initiating any failovers", but I like > additional safety nets. And that's entirely understandable. > Plus, if pacemaker decides on its own that a failover is necessary (ping > went away etc.) and for one reason or another a quick sync had been > triggered (activity log resync after maintenance-cron task or similar - > granted, this shouldn't really happen), it may shoot itself in the food > unnecessarily. No it won't. Remember, even as a resync progresses, it's perfectly fine to promote the DRBD SyncTarget to the Primary role, and that is what Pacemaker would do. In that case, the SyncSource does not go away, and DRBD has a node to fetch good data from. The problem you described exists only if the SyncSource is taken down. And if your failover is due to the Master going down hard beyond repair, while being a SyncSource, then of course you've got a bigger problem. You can still go back to an automatically-generated snapshot if you have one, though. >> That would be attempting to put out a fire with gasoline. A failed stop >> leads to fencing, and then you've got an inconsistent node and a dead node. > > Does this hold true when stonith-enabled=false? (For the ones getting this discussion from the archives: this goes back to the original "put the Master, which is a DRBD SyncSource, into Standby" scenario, not the "Pacemaker-initiated failover" one.) Doesn't change much. With stonith-enabled=false, ocf:linbit:drbd looping during stop until the sync completes, and stop hitting its timeout, failover would occur. Except that that doesn't magically make your data UpToDate on the note you fail over to. Cheers, Florian -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110622/8c55c19a/attachment.pgp>