[DRBD-user] Sudden system load

Thu Jun 9 03:42:28 CEST 2011

Hi Pascal,

Sorry it took a while to respond this time.

At 11:37 PM 5/26/2011, Pascal BERTON wrote:
>Hi, Richard!
>
>Ok, good news for the NICs, but bad news to you unfortunately :)
>Well, since the /dev/sdb is used by this /work directory, we might try to
>investigate this track. Some infos missing though : When the platform calms
>down, do you still see à 10:1 ratio between sdb and other volumes write
>activity ?

Yes.  It pretty much stays the same regardless of the load.

>When the platform experiences this short peak level, does this 10:1 ratio
>hit a peak too, and may be rise to 15:1 or 20:1 ?

Nope.

>What I'm suspecting, now that you've checked a good amount of things, is :
>What if DRD was doing nothing but what it is asked to, and if the problem
>was coming from your mail application? I mean, if your software starts doing
>odd thing, DRBD will react too, and what you see might only be the visible
>part of the iceberg...
>Don't you have installed some patches on may 10th ? Or did you reconfigure
>something in your mail subsystem that would explain this /work usage
>increase ?

No.  I _WISH_ we had changed something, but we didn't.

>More generally, what does cause these writes in /dev/sdb ? What
>type of activity ? (Please don't answer "work"! ;) )

I have no idea.  Note that /work is only allotted 50G, less than 10% of the
total available on that disk.  The rest is allotted to letters.  I don't
know of a way to see which letters (or work) are writing to /dev/sdb as
opposed to /dev/sdc.  The only thing writing to /work with any regularity
is squirrelmail's pref files, and that certainly can't account for that
much activity (especially since if we disable the webmail servers, it
doesn't go down).

Still at a loss here, any help appreciated.
Thanks.
  - Richard

>Best regards,
>
>Pascal.
>
>-----Message d'origine-----
>De : drbd-user-bounces at lists.linbit.com
>[mailto:drbd-user-bounces at lists.linbit.com] De la part de Richard Stockton
>Envoyé : vendredi 27 mai 2011 03:43
>À : drbd-user at lists.linbit.com
>Objet : [DRBD-user] Sudden system load
>
>Thanks to those who helped on the first go-around of this issue.
>We have a better handle on the indicators, but still no solution.
>
>We use DRBD to handle our 8000 (or so) pop/imap accounts.  These exist
>on 2 separate servers, each with 1/2 the alphabet, and each set to
>failover to the other if one goes down.  So MDA01 handles accounts
>beginning with "A" through "L", while MDA02 handled letters "M" through
>"Z", plus a "/work/" partition.  Each server replicates its letters (via
>drbd) to the other server, and heartbeat can force all the partitions to
>be handled on a single server if the other goes down.  The various mail
>servers connect to the DRBD machines via NFS3 (rw,udp,intr,noatime,nolock).
>
>We have been using this system for several years now without serious
>issues, until suddenly, on or about noon on May 10, 2011, the server
>load went from a normal less than 1.0 to 14+.  It does not remain high
>all day, in fact it seems to run normally for about 10.5 hours and then
>run higher than normal for the next 13.5 hours, although on Tuesday it
>never hit the high loads even though activity was at a normal level.
>Also, it builds to a peak (anywhere from 12 to 28) somewhere in the
>middle of that 13.5 hours, holds it for only a few minutes (10-15) and
>then trails off again.  The pattern repeats on weekends too, but the
>load is much lower (3-4).
>
>Each server has almost identical drives allocated for this purpose.
>
>mda01:root:> /usr/sbin/pvdisplay
>    --- Physical volume ---
>    PV Name               /dev/sdb
>    VG Name               lvm
>    PV Size               544.50 GB / not usable 4.00 MB
>    Allocatable           yes (but full)
>    PE Size (KByte)       4096
>    Total PE              139391
>    Free PE               0
>    Allocated PE          139391
>    PV UUID               vLYWPe-TsJk-L6cv-Dycp-GBNp-XyfV-KkxzET
>
>    --- Physical volume ---
>    PV Name               /dev/sdc1
>    VG Name               lvm
>    PV Size               1.23 TB / not usable 3.77 MB
>    Allocatable           yes
>    PE Size (KByte)       4096
>    Total PE              321503
>    Free PE               89695
>    Allocated PE          231808
>    PV UUID               QVpf66-euRN-je7I-L0oq-Cahk-ezFr-HAa3vx
>
>mda02:root:> /usr/sbin/pvdisplay
>    --- Physical volume ---
>    PV Name               /dev/sdb
>    VG Name               lvm
>    PV Size               544.50 GB / not usable 4.00 MB
>    Allocatable           yes (but full)
>    PE Size (KByte)       4096
>    Total PE              139391
>    Free PE               0
>    Allocated PE          139391
>    PV UUID               QY8V1P-uCni-Va3b-2Ypl-7wP9-lEtl-ofD05G
>
>    --- Physical volume ---
>    PV Name               /dev/sdc1
>    VG Name               lvm
>    PV Size               1.23 TB / not usable 3.77 MB
>    Allocatable           yes
>    PE Size (KByte)       4096
>    Total PE              321503
>    Free PE               87135
>    Allocated PE          234368
>    PV UUID               E09ufG-Qkep-I4hB-3Zda-n7Vy-7zXZ-Nn0Lvi
>
>So, a 1.75 TB virtual disk on each server, 81% allocated.
>
>The part that really confuses me is that the 2 500GB drives seem
>to always have 10 times as many writes going on as the 1.23TB
>drives.  (Parts of the iostat output removed for readability)
>
>mda01:root:> iostat | head -13
>Linux 2.6.18-128.1.10.el5 (mda01.adhost.com)    05/26/2011
>
>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.10    0.04    2.09   19.41    0.00   78.37
>
>Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>sda               3.21        25.87        73.52   36074352  102536956
>sda1              0.00         0.00         0.00       2566         60
>sda2              0.00         0.00         0.00       2664        448
>sda3              3.21        25.86        73.52   36066666  102536448
>sdb             336.60       205.27      5675.40  286280151 7915188839
>sdc              27.56       148.08       622.15  206517470  867684260
>sdc1             27.56       148.08       622.15  206514758  867684260
>
>
>mda02:root:> iostat | head -13
>Linux 2.6.18-128.1.10.el5 (mda02.adhost.com)    05/26/2011
>
>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.10    0.05    1.87   12.33    0.00   85.65
>
>Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
>sda               2.84        18.23        68.26   25339938   94863896
>sda1              0.00         0.00         0.00       2920         56
>sda2              0.00         0.00         0.00       2848       1568
>sda3              2.84        18.23        68.26   25331994   94862272
>sdb             333.45       109.90      5679.15  152727845 7892497866
>sdc              29.93       124.20       588.41  172601220  817732660
>sdc1             29.93       124.20       588.41  172598796  817732660
>
>
>We have checked the network I/O and the NIC's.  There are no errors, no
>dropped packets, no overruns, etc.  The NIC's look perfect.
>
>We have run rkhunter and chkrootkit on both machines and they found nothing.
>
>RedHat 5.3 (2.6.18-128.1.10.el5)
>DRBD 8.3.1
>Heartbeat 2.1.4
>
>Again, any ideas about what is happening, and/or additional diagnostics
>we might run would be much appreciated.
>Thank you.
>   - Richard
>
>_______________________________________________
>drbd-user mailing list
>drbd-user at lists.linbit.com
>http://lists.linbit.com/mailman/listinfo/drbd-user