Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Yeah, I think the cable is not the culprit. RAM seems OK, Memtest didn't detect anything (tested during 20h). The server uses Fully Buffered, which I'm pretty sure corrects errors. System Memory Testing is enabled in the BIOS. I think the problem lies with the RAID controller. Debian Etch (and Ubuntu) doesn't provide a recent enough driver to work reliably with this recent controller, according to this post : http://ubuntuforums.org/showthread.php?s=eee2a8d3d4447c3e355014c18770e89b&t=719556 I dismissed the warning about the obsolete driver (minimal required driver version = 00.00.03.13 ; I use 00.00.03.01); I shouldn't have. I suppose this could explain the crashes under heavy I/O load (with drbdadm verify all), the data corruption, and something else I experienced yesterday : 1) a kernel panic right at the very beginning of the boot process ! : "(...) <0> Kernel Panic - not syncing : Attempted to kill the idle task !" 2) after forcibly rebooting the server, it remained stuck in a loop : "Starting Systems Management Device Drivers Starting ipmi driver : Starting Systems Management Device Drivers Starting ipmi driver : Starting Systems Management Device Drivers Starting ipmi driver : Starting Systems Management Device Drivers Starting ipmi driver : Starting Systems Management Device Drivers Starting ipmi driver : Starting Systems Management Device Drivers Starting ipmi driver : (...)" 3) the third boot (this time, I chose single user mode and pressed Ctrl+D to "continue") remained stuck for about ten seconds on : "INIT : Entering runlevel : 2 Starting system log daemon : syslogd Starting kernel log daemon : klogd" then continued normally. Besides, the firmware has just been updated on DELL's site with criticality = urgent. Eric Brian Candler wrote : > On Tue, Jun 24, 2008 at 10:42:37AM +0200, Eric Marin wrote: >> Maybe the crossover ethernet cable is simply bad (!) > > Aside: I'd say this is unlikely. A packet corrupted on the wire would have > to pass both the ethernet CRC check and the TCP checksum. That is, there > would have to a very severe problem at layer 1 that it could occasionally > bypass the protections at layers 2 and 3 - there would also be lots of > packet loss and very poor TCP performance. > > To be more sure you can look for errors using netstat -i. If those counters > are zero then you can be pretty sure that the cabling is not the problem. > > Regarding RAM: the type which detects and/or corrects errors is called > "ECC". As well as having the right type of RAM, your motherboard needs to > support ECC, and have it enabled, to get this protection. > > Regards, > > Brian. >