Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, Richard! Ok, good news for the NICs, but bad news to you unfortunately :) Well, since the /dev/sdb is used by this /work directory, we might try to investigate this track. Some infos missing though : When the platform calms down, do you still see à 10:1 ratio between sdb and other volumes write activity ? When the platform experiences this short peak level, does this 10:1 ratio hit a peak too, and may be rise to 15:1 or 20:1 ? What I'm suspecting, now that you've checked a good amount of things, is : What if DRD was doing nothing but what it is asked to, and if the problem was coming from your mail application? I mean, if your software starts doing odd thing, DRBD will react too, and what you see might only be the visible part of the iceberg... Don't you have installed some patches on may 10th ? Or did you reconfigure something in your mail subsystem that would explain this /work usage increase ? More generally, what does cause these writes in /dev/sdb ? What type of activity ? (Please don't answer "work"! ;) ) Best regards, Pascal. -----Message d'origine----- De : drbd-user-bounces at lists.linbit.com [mailto:drbd-user-bounces at lists.linbit.com] De la part de Richard Stockton Envoyé : vendredi 27 mai 2011 03:43 À : drbd-user at lists.linbit.com Objet : [DRBD-user] Sudden system load Thanks to those who helped on the first go-around of this issue. We have a better handle on the indicators, but still no solution. We use DRBD to handle our 8000 (or so) pop/imap accounts. These exist on 2 separate servers, each with 1/2 the alphabet, and each set to failover to the other if one goes down. So MDA01 handles accounts beginning with "A" through "L", while MDA02 handled letters "M" through "Z", plus a "/work/" partition. Each server replicates its letters (via drbd) to the other server, and heartbeat can force all the partitions to be handled on a single server if the other goes down. The various mail servers connect to the DRBD machines via NFS3 (rw,udp,intr,noatime,nolock). We have been using this system for several years now without serious issues, until suddenly, on or about noon on May 10, 2011, the server load went from a normal less than 1.0 to 14+. It does not remain high all day, in fact it seems to run normally for about 10.5 hours and then run higher than normal for the next 13.5 hours, although on Tuesday it never hit the high loads even though activity was at a normal level. Also, it builds to a peak (anywhere from 12 to 28) somewhere in the middle of that 13.5 hours, holds it for only a few minutes (10-15) and then trails off again. The pattern repeats on weekends too, but the load is much lower (3-4). Each server has almost identical drives allocated for this purpose. mda01:root:> /usr/sbin/pvdisplay --- Physical volume --- PV Name /dev/sdb VG Name lvm PV Size 544.50 GB / not usable 4.00 MB Allocatable yes (but full) PE Size (KByte) 4096 Total PE 139391 Free PE 0 Allocated PE 139391 PV UUID vLYWPe-TsJk-L6cv-Dycp-GBNp-XyfV-KkxzET --- Physical volume --- PV Name /dev/sdc1 VG Name lvm PV Size 1.23 TB / not usable 3.77 MB Allocatable yes PE Size (KByte) 4096 Total PE 321503 Free PE 89695 Allocated PE 231808 PV UUID QVpf66-euRN-je7I-L0oq-Cahk-ezFr-HAa3vx mda02:root:> /usr/sbin/pvdisplay --- Physical volume --- PV Name /dev/sdb VG Name lvm PV Size 544.50 GB / not usable 4.00 MB Allocatable yes (but full) PE Size (KByte) 4096 Total PE 139391 Free PE 0 Allocated PE 139391 PV UUID QY8V1P-uCni-Va3b-2Ypl-7wP9-lEtl-ofD05G --- Physical volume --- PV Name /dev/sdc1 VG Name lvm PV Size 1.23 TB / not usable 3.77 MB Allocatable yes PE Size (KByte) 4096 Total PE 321503 Free PE 87135 Allocated PE 234368 PV UUID E09ufG-Qkep-I4hB-3Zda-n7Vy-7zXZ-Nn0Lvi So, a 1.75 TB virtual disk on each server, 81% allocated. The part that really confuses me is that the 2 500GB drives seem to always have 10 times as many writes going on as the 1.23TB drives. (Parts of the iostat output removed for readability) mda01:root:> iostat | head -13 Linux 2.6.18-128.1.10.el5 (mda01.adhost.com) 05/26/2011 avg-cpu: %user %nice %system %iowait %steal %idle 0.10 0.04 2.09 19.41 0.00 78.37 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 3.21 25.87 73.52 36074352 102536956 sda1 0.00 0.00 0.00 2566 60 sda2 0.00 0.00 0.00 2664 448 sda3 3.21 25.86 73.52 36066666 102536448 sdb 336.60 205.27 5675.40 286280151 7915188839 sdc 27.56 148.08 622.15 206517470 867684260 sdc1 27.56 148.08 622.15 206514758 867684260 mda02:root:> iostat | head -13 Linux 2.6.18-128.1.10.el5 (mda02.adhost.com) 05/26/2011 avg-cpu: %user %nice %system %iowait %steal %idle 0.10 0.05 1.87 12.33 0.00 85.65 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 2.84 18.23 68.26 25339938 94863896 sda1 0.00 0.00 0.00 2920 56 sda2 0.00 0.00 0.00 2848 1568 sda3 2.84 18.23 68.26 25331994 94862272 sdb 333.45 109.90 5679.15 152727845 7892497866 sdc 29.93 124.20 588.41 172601220 817732660 sdc1 29.93 124.20 588.41 172598796 817732660 We have checked the network I/O and the NIC's. There are no errors, no dropped packets, no overruns, etc. The NIC's look perfect. We have run rkhunter and chkrootkit on both machines and they found nothing. RedHat 5.3 (2.6.18-128.1.10.el5) DRBD 8.3.1 Heartbeat 2.1.4 Again, any ideas about what is happening, and/or additional diagnostics we might run would be much appreciated. Thank you. - Richard _______________________________________________ drbd-user mailing list drbd-user at lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user