[DRBD-user] Sudden system load

Fri May 27 03:42:47 CEST 2011

Thanks to those who helped on the first go-around of this issue.
We have a better handle on the indicators, but still no solution.

We use DRBD to handle our 8000 (or so) pop/imap accounts.  These exist
on 2 separate servers, each with 1/2 the alphabet, and each set to
failover to the other if one goes down.  So MDA01 handles accounts
beginning with "A" through "L", while MDA02 handled letters "M" through
"Z", plus a "/work/" partition.  Each server replicates its letters (via
drbd) to the other server, and heartbeat can force all the partitions to
be handled on a single server if the other goes down.  The various mail
servers connect to the DRBD machines via NFS3 (rw,udp,intr,noatime,nolock).

We have been using this system for several years now without serious
issues, until suddenly, on or about noon on May 10, 2011, the server
load went from a normal less than 1.0 to 14+.  It does not remain high
all day, in fact it seems to run normally for about 10.5 hours and then
run higher than normal for the next 13.5 hours, although on Tuesday it
never hit the high loads even though activity was at a normal level.
Also, it builds to a peak (anywhere from 12 to 28) somewhere in the
middle of that 13.5 hours, holds it for only a few minutes (10-15) and
then trails off again.  The pattern repeats on weekends too, but the
load is much lower (3-4).

Each server has almost identical drives allocated for this purpose.

mda01:root:> /usr/sbin/pvdisplay
   --- Physical volume ---
   PV Name               /dev/sdb
   VG Name               lvm
   PV Size               544.50 GB / not usable 4.00 MB
   Allocatable           yes (but full)
   PE Size (KByte)       4096
   Total PE              139391
   Free PE               0
   Allocated PE          139391
   PV UUID               vLYWPe-TsJk-L6cv-Dycp-GBNp-XyfV-KkxzET

   --- Physical volume ---
   PV Name               /dev/sdc1
   VG Name               lvm
   PV Size               1.23 TB / not usable 3.77 MB
   Allocatable           yes
   PE Size (KByte)       4096
   Total PE              321503
   Free PE               89695
   Allocated PE          231808
   PV UUID               QVpf66-euRN-je7I-L0oq-Cahk-ezFr-HAa3vx

mda02:root:> /usr/sbin/pvdisplay
   --- Physical volume ---
   PV Name               /dev/sdb
   VG Name               lvm
   PV Size               544.50 GB / not usable 4.00 MB
   Allocatable           yes (but full)
   PE Size (KByte)       4096
   Total PE              139391
   Free PE               0
   Allocated PE          139391
   PV UUID               QY8V1P-uCni-Va3b-2Ypl-7wP9-lEtl-ofD05G

   --- Physical volume ---
   PV Name               /dev/sdc1
   VG Name               lvm
   PV Size               1.23 TB / not usable 3.77 MB
   Allocatable           yes
   PE Size (KByte)       4096
   Total PE              321503
   Free PE               87135
   Allocated PE          234368
   PV UUID               E09ufG-Qkep-I4hB-3Zda-n7Vy-7zXZ-Nn0Lvi

So, a 1.75 TB virtual disk on each server, 81% allocated.

The part that really confuses me is that the 2 500GB drives seem
to always have 10 times as many writes going on as the 1.23TB
drives.  (Parts of the iostat output removed for readability)

mda01:root:> iostat | head -13
Linux 2.6.18-128.1.10.el5 (mda01.adhost.com)    05/26/2011

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.10    0.04    2.09   19.41    0.00   78.37

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               3.21        25.87        73.52   36074352  102536956
sda1              0.00         0.00         0.00       2566         60
sda2              0.00         0.00         0.00       2664        448
sda3              3.21        25.86        73.52   36066666  102536448
sdb             336.60       205.27      5675.40  286280151 7915188839
sdc              27.56       148.08       622.15  206517470  867684260
sdc1             27.56       148.08       622.15  206514758  867684260

mda02:root:> iostat | head -13
Linux 2.6.18-128.1.10.el5 (mda02.adhost.com)    05/26/2011

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.10    0.05    1.87   12.33    0.00   85.65

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               2.84        18.23        68.26   25339938   94863896
sda1              0.00         0.00         0.00       2920         56
sda2              0.00         0.00         0.00       2848       1568
sda3              2.84        18.23        68.26   25331994   94862272
sdb             333.45       109.90      5679.15  152727845 7892497866
sdc              29.93       124.20       588.41  172601220  817732660
sdc1             29.93       124.20       588.41  172598796  817732660

We have checked the network I/O and the NIC's.  There are no errors, no
dropped packets, no overruns, etc.  The NIC's look perfect.

We have run rkhunter and chkrootkit on both machines and they found nothing.

RedHat 5.3 (2.6.18-128.1.10.el5)
DRBD 8.3.1
Heartbeat 2.1.4

Again, any ideas about what is happening, and/or additional diagnostics
we might run would be much appreciated.
Thank you.
  - Richard