[DRBD-user] Primary 0.6.13 box crashing under load

Matt Smith msmith at risklabs.com
Thu Sep 22 15:57:45 CEST 2005

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Anybody have ANY comments/suggestions on this??  

--Matt


-----Original Message-----
From: drbd-user-bounces at lists.linbit.com
[mailto:drbd-user-bounces at lists.linbit.com] 
Sent: Monday, September 19, 2005 1:14 PM
To: drbd-user at lists.linbit.com
Subject: [DRBD-user] Primary 0.6.13 box crashing under load


Setup:

Two Dell PE2650's with Direct Attached SCSI PV220's/split
backplane/PERC4 dual channel Raid. 1GB Ram Dedicated Gig-E Crossover
interface for DRBD replication Serial connection for heartbeat Both
running RedHat AS 2.1, DRBD 0.6.10 (to begin with) 7 drbd volumes - each
approximately 140GB - most are under 20% full.. one is about 94% full.
Been running since late 2003 with much success.

What happened:

Under heavy write load the primary box will halt in a most ugly way.
Nothing logged other than the usual "tl messed up, transferlog too
small!!, epoch" messages (tl set at 5000).  After a handful of those,
nothing, then the machine totally freezes.  Have to cycle power.
Fortunately, the standby machine was taking over fairly nicely.  We
generated this load using bonnie++ running on a separate client that had
the drbd box nfs mounted.  We tried several different nfs block sizes on
the client, so we saw this happen on both boxes after waiting for DRBD
to complete the full resync.

What we did next:

Taking the standard approach to resolving this type of strange problem,
we updated the kernel to the latest from RedHat, and updated DRBD to
0.6.13 on both boxes (the upgrade to 0.7 was a little too complicated
for now).  That much went well.

We tried the bonnie++ tests and again (much more quickly) the primary
box failed and we switched to the standby.  THIS TIME, ext3 filesystem
corruption occurred.  I e2fsck'd the disks, and restored the filesystems
from backup.  At some point during the restore, one of the fileservers
died (same as before).

This morning, I restarted the box that died, and when it started the
full resync, it caused the primary to fail again (same as before).  So,
now without any measurable NFS load, the box is dying... same errors in
the logs.

So, for now, I've disabled drbd and am running on just a single
fileserver, standalone.  Seems OK for now.

Research:

The only thread I found on the mailing lists was a couple years back
where it was suspected that a memory problem would explain these errors.
Both systems have ECC ram, but I'm checking it anyway using Dell's
MpMemory diag utility.  I've run it 3 times so far without error.

I'm at a loss.  I am curious about any possible explanation, and
recommendations on going to 0.7...  

Thanks in advance.

Matt Smith

_______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user



More information about the drbd-user mailing list