[DRBD-user] make a script for my needs

Fri Jul 5 16:19:47 CEST 2013

On Thu, Jul 04, 2013 at 10:25:01AM -0700, brain at click.com.py wrote:
> Hi Lars
> 
> According to your words:
> "With special purpose built fencing handlers, we may be able to fix
> your setup so it will freeze IO during the disconnected period,
> reconnect, and replay pending buffers, without any reset.", and ...
> 
> "Back to the sentence that you chose to quote, you would have to write
> (program) those "special purpose built fencing handlers" first, and
> they would only make any sense *after* you implemented fencing (and
> upgraded your DRBD).
> 
> And even then, it would be a hack, a bandaid, only to paper over the
> real problem.  So again: this is NOT what I recommend you to do."
> 
> I reply:
> Many thanks for your recommendations, but i want the script.

[--- snipped christmas wishlist ---]

No you don't.

You want to understand your "problem" and its implications.
You want to re-read the DRBD User's Guide, including the appendices.
You want to stop using Dual-Primary where it is misplaced.
You want to stop using "data integrity alg" where it hurts.

And you want *real* fencing.

Stop reading now, and follow the advise above.

data-integrity-alg is a *diagnostic* feature.  It cannot prevent data
corruption, much less so when it is *potential* data corruption caused
by misbehaving upper layers.  Because they will keep misbehaving.
But this has been discussed before, and you are not interested.

That said:
DRBD fencing policy "resource and stonith" will freeze IO,
and if that fence-peer handler does not return "succesfully fenced",
or does not return at all,
 IO will stay frozen,
 until
  * replication link is re-established *and* both nodes agree that it is
    safe to resume replication
  * admin explicitly resumes io

As proxmox "HA" apparently is built around rgmanager, there will be
fence agents available, so you should be able to use the available
"obliterate peer" or "rhcs fence" and/or tailor them to your needs.

But if you insist that you think you want "that" script,
(even though I tell you that you really truely do NOT want that,
not without understanding the full picture and implications,
but you don't care, because you are immune to feedback),
use fencing handler of "exit 1", or "sleep 864000".
That may "appear" to "solve" your "problem",
but break every *real* failure scenario (because IO will be frozen).

And it will NOT help you at all, if you get those checksum mismatches
that often, because now every time that happens, you will freeze IO,
or hard-reset at least one of your nodes. Does not solve anything.
Don't calculate and compare checksums, if you know they won't match anyways.

Please *FIX YOUR SETUP*.
Don't add more bandages and sticky tape.

Ah well. Worth a try.

	Lars