[DRBD-user] drbdadm verify all seems to produce false positives on ext3 and crash the server

Mon Jun 30 15:27:50 CEST 2008

Hi again...
I feel the need to jump in here and share what I've found out so far...

I have experianced the same kind of problem as you have. Crashes, keyboard
lockout, blank screens, kernel panics and stuff like that. For some strange
reason on sundays (or monday mornings)!

Well, I tried to explain in an earlier thread what was going on and failed
completely. Anyway, I said that I would leave the system as it is to prove
that it's the verify procedure that causes the system to go completly mad.

Now, 4-5 weeks later (maybe more) there has been nothing fishy going on with
the system. Instead of having a crash every monday morning (3 or 4 weeks in a
row) I have not seen anything since I stopped playing around with verify.

So, my conclution: Don't use verify. I want to, but it will crash the system.

I was not able to get anything out from the logs and the kernel-panic was
completly rubbish. Different panics, if there even was a panic. Mostly just a
freeze...

I don't have the logs anymore, but the last message in the log for at least
one of the crashes was a message like:
"Re-transfer bitmap due to failed kmalloc" or something like that...

The message before that was telling me that the online verify was completed I
beleave...

So, my suggestion is stop looking for hardware errors. Instead stop using
"verify". That's what I did when I was unable to find a hardware error...

The sad thing is that I can not reproduce this error except on my machines
that are in production enviroment right now...

Cheers!

/Stefan

---------- Original Message -----------
From: Eric Marin <eric.marin at utc.fr>
To: drbd-user at linbit.com
Sent: Wed, 25 Jun 2008 09:57:52 +0200
Subject: Re: [DRBD-user] drbdadm verify all seems to produce false positives
on ext3 and crash the server

> Yeah, I think the cable is not the culprit.
> 
> RAM seems OK, Memtest didn't detect anything (tested during 20h).
> The server uses Fully Buffered, which I'm pretty sure corrects 
> errors. System Memory Testing is enabled in the BIOS.
> 
> I think the problem lies with the RAID controller.
> Debian Etch (and Ubuntu) doesn't provide a recent enough driver to 
> work reliably with this recent controller, according to this post :
http://ubuntuforums.org/showthread.php?s=eee2a8d3d4447c3e355014c18770e89b&t=719556
> I dismissed the warning about the obsolete driver (minimal required 
> driver version = 00.00.03.13 ; I use 00.00.03.01); I shouldn't have.
> 
> I suppose this could explain the crashes under heavy I/O load (with 
> drbdadm verify all), the data corruption, and something else I 
> experienced yesterday :
> 1) a kernel panic right at the very beginning of the boot process ! :
> "(...)
>   <0> Kernel Panic - not syncing : Attempted to kill the idle task !"
> 
> 2) after forcibly rebooting the server, it remained stuck in a loop :
> "Starting Systems Management Device Drivers
>   Starting ipmi driver :
>   Starting Systems Management Device Drivers
>   Starting ipmi driver :
>   Starting Systems Management Device Drivers
>   Starting ipmi driver :
>   Starting Systems Management Device Drivers
>   Starting ipmi driver :
>   Starting Systems Management Device Drivers
>   Starting ipmi driver :
>   Starting Systems Management Device Drivers
>   Starting ipmi driver :
>   (...)"
> 
> 3) the third boot (this time, I chose single user mode and pressed 
> Ctrl+D to "continue") remained stuck for about ten seconds on :
> "INIT : Entering runlevel : 2
>   Starting system log daemon : syslogd
>   Starting kernel log daemon : klogd"
>   then continued normally.
> 
> Besides, the firmware has just been updated on DELL's site with 
> criticality = urgent.
> 
> Eric
> 
> Brian Candler wrote :
> > On Tue, Jun 24, 2008 at 10:42:37AM +0200, Eric Marin wrote:
> >> Maybe the crossover ethernet cable is simply bad (!)
> > 
> > Aside: I'd say this is unlikely. A packet corrupted on the wire would have
> > to pass both the ethernet CRC check and the TCP checksum. That is, there
> > would have to a very severe problem at layer 1 that it could occasionally
> > bypass the protections at layers 2 and 3 - there would also be lots of
> > packet loss and very poor TCP performance.
> > 
> > To be more sure you can look for errors using netstat -i. If those counters
> > are zero then you can be pretty sure that the cabling is not the problem.
> > 
> > Regarding RAM: the type which detects and/or corrects errors is called
> > "ECC". As well as having the right type of RAM, your motherboard needs to
> > support ECC, and have it enabled, to get this protection.
> > 
> > Regards,
> > 
> > Brian.
> > 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
------- End of Original Message -------