[DRBD-user] DRBD crash on two nodes cluster. Some help please?

Theophanis Kontogiannis theophanis_kontogiannis at yahoo.com
Sun Nov 1 14:05:42 CET 2009


Hello Lars and All,

Please look bellow

On Thu, 2009-10-29 at 16:53 +0100, Lars Ellenberg wrote:

> On Thu, Oct 29, 2009 at 04:40:01PM +0200, Theophanis Kontogiannis wrote:
> > Hello all again.
> > 
> > In continuation to the bellow described issue, with integrity check
> > enabled, I used to get a crash at least once per 24 hours.
> 
> No.
> You don't get "crashes".
> 
> You configured it to fence its peer on connection loss,
> and that is what it does.
> 

Correct in strict terminology. I just had in my mind that both nodes get
fenced so I get "crush" in the sense of having no service.
But yes, the actual thing is that it gets fenced.


> > Now I have integrity check disabled and the cluster is running without
> > crashes for the last 9 days.
> > 
> > Could someone kindly provide some hints for the possible reasons  of
> > this observed behavior?
> > 
> > Off-loading is disabled on both dedicated gigabit NICs.
> 
> Either something modifies in-flight buffers,
> which may or may not be intentional,
> and may or may not be "safe" wrt file system data integrity.
> 
> Or you actually _do_ have data corruption.
> 
> If drbd detects checksum mismatch (== data corruption,
> or more general: data received is not the same as
> it was when calculating the checksum before it was
> send), rather than knowingly writing diverging data,
> drbd disconnects, and tries to reconnect,
> hoping for the bitmap based resync to send
> "better" data this time.
> 
> On disconnect, if so configured, a primary will call its
> fence-peer handler.
> 
> You configured "obliterate" as fence peer handler.
> 
> So it "obliterates" its peer.
> 
> > Also is integrity-check really needed (I have read the
> > documentation :) ) if it keeps on breaking the cluster?
> 
> If you rather have silent data corruption :-)
> 
> ==> Find the cause of the checksum mismatch.
> 

Is there any way to track to really low level the crc error? Turn on
insane debugging on drbd or something else?
I can not think of any good way to go low level for that!

Thank you All for your time.
T.K.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20091101/b04139f1/attachment.htm>


More information about the drbd-user mailing list