[DRBD-user] Sudden high CPU load [SOLVED]

Wed May 18 22:02:31 CEST 2011

Hi Felix,

Thanks for responding, answers (and resolution) below...

At 12:20 AM 5/18/2011, Felix Frank wrote:
>On 05/18/2011 04:09 AM, Richard Stockton wrote:
> > Hi drbd folks,
> >
> > We have been using DRBD with heartbeat for several years now.
> > The load averages on our RedHat servers have almost always stayed
> > below 1.0.  Suddenly last week the loads jumped up to 12-14 on
> > both servers.  They go down at night, when the usage goes down,
> > but by mid-day every business day they are back to 14.
> >
> > We don't see anything out of the ordinary in logs, no drive
> > warning lights, no degraded RAID, nothing especially out of
> > bounds in iostat, NFS is running smoothly, tcpdump doesn't
> > show any nastiness (that we can see)....
> > We have run out of ideas.
> >
> > Our setup is 2 disk arrays, with each server the fail-over for
> > the other.  These are POP/IMAP accounts with half the alphabet
> > on one server and half on the other.  If one server fails, the
> > other one takes over all of the alphabet.  Each letter is a
> > separate resource, with a separate IP, etc.  The size of the
> > letter partitions range from 5G to 120G.  The pop/imap servers
> > access the letters via NFS3.
> >
> > RedHat 5.3 (2.6.18-128.1.10.el5)
> > DRBD 8.3.1
> > Heartbeat 2.1.4
> >
> > Has anyone seen this sort of behavior before, and if so, what
> > the heck is happening?  We would appreciate any suggestions.
>
>Is NFS mounted sync or async?

NFS is mounted "sync" (NFS3 default, I believe).

>Which I/O scheduler is in use for your DRBDs (i.e., for their backing
>devices)?

cfg (also the default, I believe)

Now for the interesting part.  This problem had been occurring every
day for about a week, so I joined this list yesterday and posted my
issue.  When I came to work this morning the problem had magically
subsided!  Loads are now back to 2.0 or less, even during the busy
times.  This is still slightly higher than normal, but certainly not
enough to adversely effect performance.

At this point we are assuming that some rebuild was happening in the
background, and because of the somewhat large disk arrays, (1.25 TB
each), it just took a very long time.  Either that or just the act
of joining the list frightened the system into submission. [grin]
In any case, all appears to be good now.

However, we would still love to know exactly what was happening,
so we can be prepared to deal with it if it ever happens again.
If anyone has a clue, please share.

Thanks again.
   - Richard