[DRBD-user] DRBD crash on two nodes cluster. Some help please?

Thu Oct 29 16:53:24 CET 2009

On Thu, Oct 29, 2009 at 04:40:01PM +0200, Theophanis Kontogiannis wrote:
> Hello all again.
> 
> In continuation to the bellow described issue, with integrity check
> enabled, I used to get a crash at least once per 24 hours.

No.
You don't get "crashes".

You configured it to fence its peer on connection loss,
and that is what it does.

> Now I have integrity check disabled and the cluster is running without
> crashes for the last 9 days.
> 
> Could someone kindly provide some hints for the possible reasons  of
> this observed behavior?
> 
> Off-loading is disabled on both dedicated gigabit NICs.

Either something modifies in-flight buffers,
which may or may not be intentional,
and may or may not be "safe" wrt file system data integrity.

Or you actually _do_ have data corruption.

If drbd detects checksum mismatch (== data corruption,
or more general: data received is not the same as
it was when calculating the checksum before it was
send), rather than knowingly writing diverging data,
drbd disconnects, and tries to reconnect,
hoping for the bitmap based resync to send
"better" data this time.

On disconnect, if so configured, a primary will call its
fence-peer handler.

You configured "obliterate" as fence peer handler.

So it "obliterates" its peer.

> Also is integrity-check really needed (I have read the
> documentation :) ) if it keeps on breaking the cluster?

If you rather have silent data corruption :-)

==> Find the cause of the checksum mismatch.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed