[DRBD-user] Primary 0.6.13 box crashing under load

Thu Sep 22 20:38:01 CEST 2005

Thanks for the reply!

1.  I'll get started on the memtest86 testing.  I ran the Dell tests
about a dozen times with no failures.

2.  Already at runlevel 3... the console is VERY blank.  Nothing on the
screen at all.  Probably the screensaver is on that can't be un-stuck.
Maybe that could be disabled and might show something...

3.  There's no raid port or anything like that that I'm aware of.  

4.  Not since this started, but given it's happening on two separate
boxes, I kind of discounted stuff like that - it's worth a try.

Raid 0 in hardware.  7 pairs of disks, two disks per RAID 0 volume.
That's it.

--Matt

-----Original Message-----
From: tdennist at ssa.crane.navy.mil [mailto:tdennist at ssa.crane.navy.mil]
On Behalf Of Todd Denniston
Sent: Thursday, September 22, 2005 12:40 PM
To: Matt Smith
Cc: drbd-user at linbit.com
Subject: Re: [DRBD-user] Primary 0.6.13 box crashing under load

Matt Smith wrote:
> 
> Anybody have ANY comments/suggestions on this??
> 

Just some suggestions:
1) use memtest86 included on fedora install/rescue disks as well as the
dell stuff (I know the memtest86 finds things).

2) If you are running the servers in run level 5, change that to run
level 3.
	reason: you might actually be able to see the oops/panic on the
screen, which is needed because if it is a panic the data most times is
not sent to the log on the disk (the system is messed up and linux wants
to not make the disk any worse).

3) if you can either login to the monitor (serial) port of the Raid or
use their monitor software, check to see if the RAID is seeing a
physical problem with one or more disks. (I have seen situations where I
was getting weird problems at the linux level, that were explained when
I looked at what the RAID's os was seeing.)

4) have you pulled the SCSI cables loose and reattached them lately?
i.e., clean the connections.

BTW what RAID level? 0, 1, 5, 0+1?

Good luck.

> --Matt
> 
> -----Original Message-----
> From: drbd-user-bounces at lists.linbit.com
> [mailto:drbd-user-bounces at lists.linbit.com]
> Sent: Monday, September 19, 2005 1:14 PM
> To: drbd-user at lists.linbit.com
> Subject: [DRBD-user] Primary 0.6.13 box crashing under load
> 
> Setup:
> 
> Two Dell PE2650's with Direct Attached SCSI PV220's/split 
> backplane/PERC4 dual channel Raid. 1GB Ram Dedicated Gig-E Crossover 
> interface for DRBD replication Serial connection for heartbeat Both 
> running RedHat AS 2.1, DRBD 0.6.10 (to begin with) 7 drbd volumes - 
> each approximately 140GB - most are under 20% full.. one is about 94% 
> full. Been running since late 2003 with much success.
> 
> What happened:
> 
> Under heavy write load the primary box will halt in a most ugly way. 
> Nothing logged other than the usual "tl messed up, transferlog too 
> small!!, epoch" messages (tl set at 5000).  After a handful of those, 
> nothing, then the machine totally freezes.  Have to cycle power. 
> Fortunately, the standby machine was taking over fairly nicely.  We 
> generated this load using bonnie++ running on a separate client that 
> had the drbd box nfs mounted.  We tried several different nfs block 
> sizes on the client, so we saw this happen on both boxes after waiting

> for DRBD to complete the full resync.
> 
> What we did next:
> 
> Taking the standard approach to resolving this type of strange 
> problem, we updated the kernel to the latest from RedHat, and updated 
> DRBD to 0.6.13 on both boxes (the upgrade to 0.7 was a little too 
> complicated for now).  That much went well.
> 
> We tried the bonnie++ tests and again (much more quickly) the primary 
> box failed and we switched to the standby.  THIS TIME, ext3 filesystem

> corruption occurred.  I e2fsck'd the disks, and restored the 
> filesystems from backup.  At some point during the restore, one of the

> fileservers died (same as before).
> 
> This morning, I restarted the box that died, and when it started the 
> full resync, it caused the primary to fail again (same as before).  
> So, now without any measurable NFS load, the box is dying... same 
> errors in the logs.
> 
> So, for now, I've disabled drbd and am running on just a single 
> fileserver, standalone.  Seems OK for now.
> 
> Research:
> 
> The only thread I found on the mailing lists was a couple years back 
> where it was suspected that a memory problem would explain these 
> errors. Both systems have ECC ram, but I'm checking it anyway using 
> Dell's MpMemory diag utility.  I've run it 3 times so far without 
> error.
> 
> I'm at a loss.  I am curious about any possible explanation, and 
> recommendations on going to 0.7...
> 
> Thanks in advance.
> 
> Matt Smith

-- 
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane) 
Harnessing the Power of Technology for the Warfighter