[DRBD-user] drbdadm verify all seems to produce false positives on ext3 and crash the server

Wed Jun 25 09:57:52 CEST 2008

Yeah, I think the cable is not the culprit.

RAM seems OK, Memtest didn't detect anything (tested during 20h).
The server uses Fully Buffered, which I'm pretty sure corrects errors.
System Memory Testing is enabled in the BIOS.

I think the problem lies with the RAID controller.
Debian Etch (and Ubuntu) doesn't provide a recent enough driver to work 
reliably with this recent controller, according to this post : 
http://ubuntuforums.org/showthread.php?s=eee2a8d3d4447c3e355014c18770e89b&t=719556
I dismissed the warning about the obsolete driver (minimal required 
driver version = 00.00.03.13 ; I use 00.00.03.01); I shouldn't have.

I suppose this could explain the crashes under heavy I/O load (with 
drbdadm verify all), the data corruption, and something else I 
experienced yesterday :
1) a kernel panic right at the very beginning of the boot process ! :
"(...)
  <0> Kernel Panic - not syncing : Attempted to kill the idle task !"

2) after forcibly rebooting the server, it remained stuck in a loop :
"Starting Systems Management Device Drivers
  Starting ipmi driver :
  Starting Systems Management Device Drivers
  Starting ipmi driver :
  Starting Systems Management Device Drivers
  Starting ipmi driver :
  Starting Systems Management Device Drivers
  Starting ipmi driver :
  Starting Systems Management Device Drivers
  Starting ipmi driver :
  Starting Systems Management Device Drivers
  Starting ipmi driver :
  (...)"

3) the third boot (this time, I chose single user mode and pressed 
Ctrl+D to "continue") remained stuck for about ten seconds on :
"INIT : Entering runlevel : 2
  Starting system log daemon : syslogd
  Starting kernel log daemon : klogd"
  then continued normally.

Besides, the firmware has just been updated on DELL's site with 
criticality = urgent.

Eric

Brian Candler wrote :
> On Tue, Jun 24, 2008 at 10:42:37AM +0200, Eric Marin wrote:
>> Maybe the crossover ethernet cable is simply bad (!)
> 
> Aside: I'd say this is unlikely. A packet corrupted on the wire would have
> to pass both the ethernet CRC check and the TCP checksum. That is, there
> would have to a very severe problem at layer 1 that it could occasionally
> bypass the protections at layers 2 and 3 - there would also be lots of
> packet loss and very poor TCP performance.
> 
> To be more sure you can look for errors using netstat -i. If those counters
> are zero then you can be pretty sure that the cabling is not the problem.
> 
> Regarding RAM: the type which detects and/or corrects errors is called
> "ECC". As well as having the right type of RAM, your motherboard needs to
> support ECC, and have it enabled, to get this protection.
> 
> Regards,
> 
> Brian.
>