Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Thanks to those who helped on the first go-around of this issue. We have a better handle on the indicators, but still no solution. We use DRBD to handle our 8000 (or so) pop/imap accounts. These exist on 2 separate servers, each with 1/2 the alphabet, and each set to failover to the other if one goes down. So MDA01 handles accounts beginning with "A" through "L", while MDA02 handled letters "M" through "Z", plus a "/work/" partition. Each server replicates its letters (via drbd) to the other server, and heartbeat can force all the partitions to be handled on a single server if the other goes down. The various mail servers connect to the DRBD machines via NFS3 (rw,udp,intr,noatime,nolock). We have been using this system for several years now without serious issues, until suddenly, on or about noon on May 10, 2011, the server load went from a normal less than 1.0 to 14+. It does not remain high all day, in fact it seems to run normally for about 10.5 hours and then run higher than normal for the next 13.5 hours, although on Tuesday it never hit the high loads even though activity was at a normal level. Also, it builds to a peak (anywhere from 12 to 28) somewhere in the middle of that 13.5 hours, holds it for only a few minutes (10-15) and then trails off again. The pattern repeats on weekends too, but the load is much lower (3-4). Each server has almost identical drives allocated for this purpose. mda01:root:> /usr/sbin/pvdisplay --- Physical volume --- PV Name /dev/sdb VG Name lvm PV Size 544.50 GB / not usable 4.00 MB Allocatable yes (but full) PE Size (KByte) 4096 Total PE 139391 Free PE 0 Allocated PE 139391 PV UUID vLYWPe-TsJk-L6cv-Dycp-GBNp-XyfV-KkxzET --- Physical volume --- PV Name /dev/sdc1 VG Name lvm PV Size 1.23 TB / not usable 3.77 MB Allocatable yes PE Size (KByte) 4096 Total PE 321503 Free PE 89695 Allocated PE 231808 PV UUID QVpf66-euRN-je7I-L0oq-Cahk-ezFr-HAa3vx mda02:root:> /usr/sbin/pvdisplay --- Physical volume --- PV Name /dev/sdb VG Name lvm PV Size 544.50 GB / not usable 4.00 MB Allocatable yes (but full) PE Size (KByte) 4096 Total PE 139391 Free PE 0 Allocated PE 139391 PV UUID QY8V1P-uCni-Va3b-2Ypl-7wP9-lEtl-ofD05G --- Physical volume --- PV Name /dev/sdc1 VG Name lvm PV Size 1.23 TB / not usable 3.77 MB Allocatable yes PE Size (KByte) 4096 Total PE 321503 Free PE 87135 Allocated PE 234368 PV UUID E09ufG-Qkep-I4hB-3Zda-n7Vy-7zXZ-Nn0Lvi So, a 1.75 TB virtual disk on each server, 81% allocated. The part that really confuses me is that the 2 500GB drives seem to always have 10 times as many writes going on as the 1.23TB drives. (Parts of the iostat output removed for readability) mda01:root:> iostat | head -13 Linux 2.6.18-128.1.10.el5 (mda01.adhost.com) 05/26/2011 avg-cpu: %user %nice %system %iowait %steal %idle 0.10 0.04 2.09 19.41 0.00 78.37 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 3.21 25.87 73.52 36074352 102536956 sda1 0.00 0.00 0.00 2566 60 sda2 0.00 0.00 0.00 2664 448 sda3 3.21 25.86 73.52 36066666 102536448 sdb 336.60 205.27 5675.40 286280151 7915188839 sdc 27.56 148.08 622.15 206517470 867684260 sdc1 27.56 148.08 622.15 206514758 867684260 mda02:root:> iostat | head -13 Linux 2.6.18-128.1.10.el5 (mda02.adhost.com) 05/26/2011 avg-cpu: %user %nice %system %iowait %steal %idle 0.10 0.05 1.87 12.33 0.00 85.65 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 2.84 18.23 68.26 25339938 94863896 sda1 0.00 0.00 0.00 2920 56 sda2 0.00 0.00 0.00 2848 1568 sda3 2.84 18.23 68.26 25331994 94862272 sdb 333.45 109.90 5679.15 152727845 7892497866 sdc 29.93 124.20 588.41 172601220 817732660 sdc1 29.93 124.20 588.41 172598796 817732660 We have checked the network I/O and the NIC's. There are no errors, no dropped packets, no overruns, etc. The NIC's look perfect. We have run rkhunter and chkrootkit on both machines and they found nothing. RedHat 5.3 (2.6.18-128.1.10.el5) DRBD 8.3.1 Heartbeat 2.1.4 Again, any ideas about what is happening, and/or additional diagnostics we might run would be much appreciated. Thank you. - Richard