Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Setup: Two Dell PE2650's with Direct Attached SCSI PV220's/split backplane/PERC4 dual channel Raid. 1GB Ram Dedicated Gig-E Crossover interface for DRBD replication Serial connection for heartbeat Both running RedHat AS 2.1, DRBD 0.6.10 (to begin with) 7 drbd volumes - each approximately 140GB - most are under 20% full.. one is about 94% full. Been running since late 2003 with much success. What happened: Under heavy write load the primary box will halt in a most ugly way. Nothing logged other than the usual "tl messed up, transferlog too small!!, epoch" messages (tl set at 5000). After a handful of those, nothing, then the machine totally freezes. Have to cycle power. Fortunately, the standby machine was taking over fairly nicely. We generated this load using bonnie++ running on a separate client that had the drbd box nfs mounted. We tried several different nfs block sizes on the client, so we saw this happen on both boxes after waiting for DRBD to complete the full resync. What we did next: Taking the standard approach to resolving this type of strange problem, we updated the kernel to the latest from RedHat, and updated DRBD to 0.6.13 on both boxes (the upgrade to 0.7 was a little too complicated for now). That much went well. We tried the bonnie++ tests and again (much more quickly) the primary box failed and we switched to the standby. THIS TIME, ext3 filesystem corruption occurred. I e2fsck'd the disks, and restored the filesystems from backup. At some point during the restore, one of the fileservers died (same as before). This morning, I restarted the box that died, and when it started the full resync, it caused the primary to fail again (same as before). So, now without any measurable NFS load, the box is dying... same errors in the logs. So, for now, I've disabled drbd and am running on just a single fileserver, standalone. Seems OK for now. Research: The only thread I found on the mailing lists was a couple years back where it was suspected that a memory problem would explain these errors. Both systems have ECC ram, but I'm checking it anyway using Dell's MpMemory diag utility. I've run it 3 times so far without error. I'm at a loss. I am curious about any possible explanation, and recommendations on going to 0.7... Thanks in advance. Matt Smith