[DRBD-user] Sudden high CPU load [SOLVED]

Wed May 18 22:24:32 CEST 2011

Below

On 05/18/2011 01:02 PM, Richard Stockton wrote:
> Hi Felix,
>
> Thanks for responding, answers (and resolution) below...
>
> At 12:20 AM 5/18/2011, Felix Frank wrote:
>> On 05/18/2011 04:09 AM, Richard Stockton wrote:
>> > Hi drbd folks,
>> >
>> > We have been using DRBD with heartbeat for several years now.
>> > The load averages on our RedHat servers have almost always stayed
>> > below 1.0.  Suddenly last week the loads jumped up to 12-14 on
>> > both servers.  They go down at night, when the usage goes down,
>> > but by mid-day every business day they are back to 14.
>> >
>> > We don't see anything out of the ordinary in logs, no drive
>> > warning lights, no degraded RAID, nothing especially out of
>> > bounds in iostat, NFS is running smoothly, tcpdump doesn't
>> > show any nastiness (that we can see)....
>> > We have run out of ideas.
>> >
>> > Our setup is 2 disk arrays, with each server the fail-over for
>> > the other.  These are POP/IMAP accounts with half the alphabet
>> > on one server and half on the other.  If one server fails, the
>> > other one takes over all of the alphabet.  Each letter is a
>> > separate resource, with a separate IP, etc.  The size of the
>> > letter partitions range from 5G to 120G.  The pop/imap servers
>> > access the letters via NFS3.
>> >
>> > RedHat 5.3 (2.6.18-128.1.10.el5)
>> > DRBD 8.3.1
>> > Heartbeat 2.1.4
>> >
>> > Has anyone seen this sort of behavior before, and if so, what
>> > the heck is happening?  We would appreciate any suggestions.
>>
>> Is NFS mounted sync or async?
>
> NFS is mounted "sync" (NFS3 default, I believe).
>
>> Which I/O scheduler is in use for your DRBDs (i.e., for their backing
>> devices)?
>
> cfg (also the default, I believe)
>
>
> Now for the interesting part.  This problem had been occurring every
> day for about a week, so I joined this list yesterday and posted my
> issue.  When I came to work this morning the problem had magically
> subsided!  Loads are now back to 2.0 or less, even during the busy
> times.  This is still slightly higher than normal, but certainly not
> enough to adversely effect performance.
>
> At this point we are assuming that some rebuild was happening in the
> background, and because of the somewhat large disk arrays, (1.25 TB
> each), it just took a very long time.  Either that or just the act
> of joining the list frightened the system into submission. [grin]
> In any case, all appears to be good now.
>
> However, we would still love to know exactly what was happening,
> so we can be prepared to deal with it if it ever happens again.
> If anyone has a clue, please share.
I recall someone else having similar issues and it ended up being
related to a NIC that was on its way out and sending bad packets
periodically, especially under heavy load.  Maybe check out your
networking cards/cables?

Brian
>
> Thanks again.
>   - Richard
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user