[DRBD-user] drbdadm verify all seems to produce false positives on ext3 and crash the server

Lars Ellenberg lars.ellenberg at linbit.com
Mon Jun 30 15:38:45 CEST 2008


On Mon, Jun 30, 2008 at 03:27:50PM +0200, Stefan Löfgren wrote:
> Hi again...
> I feel the need to jump in here and share what I've found out so far...
> 
> I have experianced the same kind of problem as you have. Crashes, keyboard
> lockout, blank screens, kernel panics and stuff like that. For some strange
> reason on sundays (or monday mornings)!

just randomly pointing fingers here...

but maybe you are running on top of some raid, and maybe that raid
defaults to do a "check run" every so often, and when drbd tries to
write the whole bitmap, quasi-synchronously, that just takes ages,
because the raid array is so busy "resilvering" itself?

> Well, I tried to explain in an earlier thread what was going on and failed
> completely. Anyway, I said that I would leave the system as it is to prove
> that it's the verify procedure that causes the system to go completly mad.
> 
> Now, 4-5 weeks later (maybe more) there has been nothing fishy going on with
> the system. Instead of having a crash every monday morning (3 or 4 weeks in a
> row) I have not seen anything since I stopped playing around with verify.
> 
> So, my conclution: Don't use verify. I want to, but it will crash the system.
> 
> I was not able to get anything out from the logs and the kernel-panic was
> completly rubbish. Different panics, if there even was a panic. Mostly just a
> freeze...
> 
> I don't have the logs anymore, but the last message in the log for at least
> one of the crashes was a message like:
> "Re-transfer bitmap due to failed kmalloc" or something like that...

that message would be "Writing the whole bitmap, due to failed kmalloc".
and though the "failed kmalloc" may sound scary,
it actually is not at all, and only bad wording.
so this is no hint in any direction, whatsoever.

> The message before that was telling me that the online verify was completed I
> beleave...
> 
> So, my suggestion is stop looking for hardware errors. Instead stop using
> "verify". That's what I did when I was unable to find a hardware error...
> 
> The sad thing is that I can not reproduce this error except on my machines
> that are in production enviroment right now...

too bad.
not reproducable is almost equivalent to not fixable.
and it apears to work fine here.

 :(

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please don't Cc me, but send to list -- I'm subscribed



More information about the drbd-user mailing list