Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Thanks to those who helped on the first go-around of this issue.
We have a better handle on the indicators, but still no solution.
We use DRBD to handle our 8000 (or so) pop/imap accounts. These exist
on 2 separate servers, each with 1/2 the alphabet, and each set to
failover to the other if one goes down. So MDA01 handles accounts
beginning with "A" through "L", while MDA02 handled letters "M" through
"Z", plus a "/work/" partition. Each server replicates its letters (via
drbd) to the other server, and heartbeat can force all the partitions to
be handled on a single server if the other goes down. The various mail
servers connect to the DRBD machines via NFS3 (rw,udp,intr,noatime,nolock).
We have been using this system for several years now without serious
issues, until suddenly, on or about noon on May 10, 2011, the server
load went from a normal less than 1.0 to 14+. It does not remain high
all day, in fact it seems to run normally for about 10.5 hours and then
run higher than normal for the next 13.5 hours, although on Tuesday it
never hit the high loads even though activity was at a normal level.
Also, it builds to a peak (anywhere from 12 to 28) somewhere in the
middle of that 13.5 hours, holds it for only a few minutes (10-15) and
then trails off again. The pattern repeats on weekends too, but the
load is much lower (3-4).
Each server has almost identical drives allocated for this purpose.
mda01:root:> /usr/sbin/pvdisplay
--- Physical volume ---
PV Name /dev/sdb
VG Name lvm
PV Size 544.50 GB / not usable 4.00 MB
Allocatable yes (but full)
PE Size (KByte) 4096
Total PE 139391
Free PE 0
Allocated PE 139391
PV UUID vLYWPe-TsJk-L6cv-Dycp-GBNp-XyfV-KkxzET
--- Physical volume ---
PV Name /dev/sdc1
VG Name lvm
PV Size 1.23 TB / not usable 3.77 MB
Allocatable yes
PE Size (KByte) 4096
Total PE 321503
Free PE 89695
Allocated PE 231808
PV UUID QVpf66-euRN-je7I-L0oq-Cahk-ezFr-HAa3vx
mda02:root:> /usr/sbin/pvdisplay
--- Physical volume ---
PV Name /dev/sdb
VG Name lvm
PV Size 544.50 GB / not usable 4.00 MB
Allocatable yes (but full)
PE Size (KByte) 4096
Total PE 139391
Free PE 0
Allocated PE 139391
PV UUID QY8V1P-uCni-Va3b-2Ypl-7wP9-lEtl-ofD05G
--- Physical volume ---
PV Name /dev/sdc1
VG Name lvm
PV Size 1.23 TB / not usable 3.77 MB
Allocatable yes
PE Size (KByte) 4096
Total PE 321503
Free PE 87135
Allocated PE 234368
PV UUID E09ufG-Qkep-I4hB-3Zda-n7Vy-7zXZ-Nn0Lvi
So, a 1.75 TB virtual disk on each server, 81% allocated.
The part that really confuses me is that the 2 500GB drives seem
to always have 10 times as many writes going on as the 1.23TB
drives. (Parts of the iostat output removed for readability)
mda01:root:> iostat | head -13
Linux 2.6.18-128.1.10.el5 (mda01.adhost.com) 05/26/2011
avg-cpu: %user %nice %system %iowait %steal %idle
0.10 0.04 2.09 19.41 0.00 78.37
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 3.21 25.87 73.52 36074352 102536956
sda1 0.00 0.00 0.00 2566 60
sda2 0.00 0.00 0.00 2664 448
sda3 3.21 25.86 73.52 36066666 102536448
sdb 336.60 205.27 5675.40 286280151 7915188839
sdc 27.56 148.08 622.15 206517470 867684260
sdc1 27.56 148.08 622.15 206514758 867684260
mda02:root:> iostat | head -13
Linux 2.6.18-128.1.10.el5 (mda02.adhost.com) 05/26/2011
avg-cpu: %user %nice %system %iowait %steal %idle
0.10 0.05 1.87 12.33 0.00 85.65
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 2.84 18.23 68.26 25339938 94863896
sda1 0.00 0.00 0.00 2920 56
sda2 0.00 0.00 0.00 2848 1568
sda3 2.84 18.23 68.26 25331994 94862272
sdb 333.45 109.90 5679.15 152727845 7892497866
sdc 29.93 124.20 588.41 172601220 817732660
sdc1 29.93 124.20 588.41 172598796 817732660
We have checked the network I/O and the NIC's. There are no errors, no
dropped packets, no overruns, etc. The NIC's look perfect.
We have run rkhunter and chkrootkit on both machines and they found nothing.
RedHat 5.3 (2.6.18-128.1.10.el5)
DRBD 8.3.1
Heartbeat 2.1.4
Again, any ideas about what is happening, and/or additional diagnostics
we might run would be much appreciated.
Thank you.
- Richard