[DRBD-user] Primary 0.6.13 box crashing under load

Mon Sep 19 19:14:13 CEST 2005

Setup:

Two Dell PE2650's with Direct Attached SCSI PV220's/split
backplane/PERC4 dual channel Raid.
1GB Ram
Dedicated Gig-E Crossover interface for DRBD replication
Serial connection for heartbeat
Both running RedHat AS 2.1, DRBD 0.6.10 (to begin with)
7 drbd volumes - each approximately 140GB - most are under 20% full..
one is about 94% full.
Been running since late 2003 with much success.

What happened:

Under heavy write load the primary box will halt in a most ugly way.
Nothing logged other than the usual "tl messed up, transferlog too
small!!, epoch" messages (tl set at 5000).  After a handful of those,
nothing, then the machine totally freezes.  Have to cycle power.
Fortunately, the standby machine was taking over fairly nicely.  We
generated this load using bonnie++ running on a separate client that had
the drbd box nfs mounted.  We tried several different nfs block sizes on
the client, so we saw this happen on both boxes after waiting for DRBD
to complete the full resync.

What we did next:

Taking the standard approach to resolving this type of strange problem,
we updated the kernel to the latest from RedHat, and updated DRBD to
0.6.13 on both boxes (the upgrade to 0.7 was a little too complicated
for now).  That much went well.

We tried the bonnie++ tests and again (much more quickly) the primary
box failed and we switched to the standby.  THIS TIME, ext3 filesystem
corruption occurred.  I e2fsck'd the disks, and restored the filesystems
from backup.  At some point during the restore, one of the fileservers
died (same as before).

This morning, I restarted the box that died, and when it started the
full resync, it caused the primary to fail again (same as before).  So,
now without any measurable NFS load, the box is dying... same errors in
the logs.

So, for now, I've disabled drbd and am running on just a single
fileserver, standalone.  Seems OK for now.

Research:

The only thread I found on the mailing lists was a couple years back
where it was suspected that a memory problem would explain these errors.
Both systems have ECC ram, but I'm checking it anyway using Dell's
MpMemory diag utility.  I've run it 3 times so far without error.

I'm at a loss.  I am curious about any possible explanation, and
recommendations on going to 0.7...  

Thanks in advance.

Matt Smith