[DRBD-user] Primary 0.6.13 box crashing under load

Thu Sep 22 18:39:59 CEST 2005

Matt Smith wrote:
> 
> Anybody have ANY comments/suggestions on this??
> 

Just some suggestions:
1) use memtest86 included on fedora install/rescue disks as well as the dell
stuff (I know the memtest86 finds things).

2) If you are running the servers in run level 5, change that to run level
3.
	reason: you might actually be able to see the oops/panic on the screen,
which is needed because if it is a panic the data most times is not sent to
the log on the disk (the system is messed up and linux wants to not make the
disk any worse).

3) if you can either login to the monitor (serial) port of the Raid or use
their monitor software, check to see if the RAID is seeing a physical
problem with one or more disks. (I have seen situations where I was getting
weird problems at the linux level, that were explained when I looked at what
the RAID's os was seeing.)

4) have you pulled the SCSI cables loose and reattached them lately? i.e.,
clean the connections.

BTW what RAID level? 0, 1, 5, 0+1?

Good luck.

> --Matt
> 
> -----Original Message-----
> From: drbd-user-bounces at lists.linbit.com
> [mailto:drbd-user-bounces at lists.linbit.com]
> Sent: Monday, September 19, 2005 1:14 PM
> To: drbd-user at lists.linbit.com
> Subject: [DRBD-user] Primary 0.6.13 box crashing under load
> 
> Setup:
> 
> Two Dell PE2650's with Direct Attached SCSI PV220's/split
> backplane/PERC4 dual channel Raid. 1GB Ram Dedicated Gig-E Crossover
> interface for DRBD replication Serial connection for heartbeat Both
> running RedHat AS 2.1, DRBD 0.6.10 (to begin with) 7 drbd volumes - each
> approximately 140GB - most are under 20% full.. one is about 94% full.
> Been running since late 2003 with much success.
> 
> What happened:
> 
> Under heavy write load the primary box will halt in a most ugly way.
> Nothing logged other than the usual "tl messed up, transferlog too
> small!!, epoch" messages (tl set at 5000).  After a handful of those,
> nothing, then the machine totally freezes.  Have to cycle power.
> Fortunately, the standby machine was taking over fairly nicely.  We
> generated this load using bonnie++ running on a separate client that had
> the drbd box nfs mounted.  We tried several different nfs block sizes on
> the client, so we saw this happen on both boxes after waiting for DRBD
> to complete the full resync.
> 
> What we did next:
> 
> Taking the standard approach to resolving this type of strange problem,
> we updated the kernel to the latest from RedHat, and updated DRBD to
> 0.6.13 on both boxes (the upgrade to 0.7 was a little too complicated
> for now).  That much went well.
> 
> We tried the bonnie++ tests and again (much more quickly) the primary
> box failed and we switched to the standby.  THIS TIME, ext3 filesystem
> corruption occurred.  I e2fsck'd the disks, and restored the filesystems
> from backup.  At some point during the restore, one of the fileservers
> died (same as before).
> 
> This morning, I restarted the box that died, and when it started the
> full resync, it caused the primary to fail again (same as before).  So,
> now without any measurable NFS load, the box is dying... same errors in
> the logs.
> 
> So, for now, I've disabled drbd and am running on just a single
> fileserver, standalone.  Seems OK for now.
> 
> Research:
> 
> The only thread I found on the mailing lists was a couple years back
> where it was suspected that a memory problem would explain these errors.
> Both systems have ECC ram, but I'm checking it anyway using Dell's
> MpMemory diag utility.  I've run it 3 times so far without error.
> 
> I'm at a loss.  I am curious about any possible explanation, and
> recommendations on going to 0.7...
> 
> Thanks in advance.
> 
> Matt Smith

-- 
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane) 
Harnessing the Power of Technology for the Warfighter