Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Pascal, Sorry it took a while to respond this time. At 11:37 PM 5/26/2011, Pascal BERTON wrote: >Hi, Richard! > >Ok, good news for the NICs, but bad news to you unfortunately :) >Well, since the /dev/sdb is used by this /work directory, we might try to >investigate this track. Some infos missing though : When the platform calms >down, do you still see à 10:1 ratio between sdb and other volumes write >activity ? Yes. It pretty much stays the same regardless of the load. >When the platform experiences this short peak level, does this 10:1 ratio >hit a peak too, and may be rise to 15:1 or 20:1 ? Nope. >What I'm suspecting, now that you've checked a good amount of things, is : >What if DRD was doing nothing but what it is asked to, and if the problem >was coming from your mail application? I mean, if your software starts doing >odd thing, DRBD will react too, and what you see might only be the visible >part of the iceberg... >Don't you have installed some patches on may 10th ? Or did you reconfigure >something in your mail subsystem that would explain this /work usage >increase ? No. I _WISH_ we had changed something, but we didn't. >More generally, what does cause these writes in /dev/sdb ? What >type of activity ? (Please don't answer "work"! ;) ) I have no idea. Note that /work is only allotted 50G, less than 10% of the total available on that disk. The rest is allotted to letters. I don't know of a way to see which letters (or work) are writing to /dev/sdb as opposed to /dev/sdc. The only thing writing to /work with any regularity is squirrelmail's pref files, and that certainly can't account for that much activity (especially since if we disable the webmail servers, it doesn't go down). Still at a loss here, any help appreciated. Thanks. - Richard >Best regards, > >Pascal. > >-----Message d'origine----- >De : drbd-user-bounces at lists.linbit.com >[mailto:drbd-user-bounces at lists.linbit.com] De la part de Richard Stockton >Envoyé : vendredi 27 mai 2011 03:43 >À : drbd-user at lists.linbit.com >Objet : [DRBD-user] Sudden system load > >Thanks to those who helped on the first go-around of this issue. >We have a better handle on the indicators, but still no solution. > >We use DRBD to handle our 8000 (or so) pop/imap accounts. These exist >on 2 separate servers, each with 1/2 the alphabet, and each set to >failover to the other if one goes down. So MDA01 handles accounts >beginning with "A" through "L", while MDA02 handled letters "M" through >"Z", plus a "/work/" partition. Each server replicates its letters (via >drbd) to the other server, and heartbeat can force all the partitions to >be handled on a single server if the other goes down. The various mail >servers connect to the DRBD machines via NFS3 (rw,udp,intr,noatime,nolock). > >We have been using this system for several years now without serious >issues, until suddenly, on or about noon on May 10, 2011, the server >load went from a normal less than 1.0 to 14+. It does not remain high >all day, in fact it seems to run normally for about 10.5 hours and then >run higher than normal for the next 13.5 hours, although on Tuesday it >never hit the high loads even though activity was at a normal level. >Also, it builds to a peak (anywhere from 12 to 28) somewhere in the >middle of that 13.5 hours, holds it for only a few minutes (10-15) and >then trails off again. The pattern repeats on weekends too, but the >load is much lower (3-4). > >Each server has almost identical drives allocated for this purpose. > >mda01:root:> /usr/sbin/pvdisplay > --- Physical volume --- > PV Name /dev/sdb > VG Name lvm > PV Size 544.50 GB / not usable 4.00 MB > Allocatable yes (but full) > PE Size (KByte) 4096 > Total PE 139391 > Free PE 0 > Allocated PE 139391 > PV UUID vLYWPe-TsJk-L6cv-Dycp-GBNp-XyfV-KkxzET > > --- Physical volume --- > PV Name /dev/sdc1 > VG Name lvm > PV Size 1.23 TB / not usable 3.77 MB > Allocatable yes > PE Size (KByte) 4096 > Total PE 321503 > Free PE 89695 > Allocated PE 231808 > PV UUID QVpf66-euRN-je7I-L0oq-Cahk-ezFr-HAa3vx > >mda02:root:> /usr/sbin/pvdisplay > --- Physical volume --- > PV Name /dev/sdb > VG Name lvm > PV Size 544.50 GB / not usable 4.00 MB > Allocatable yes (but full) > PE Size (KByte) 4096 > Total PE 139391 > Free PE 0 > Allocated PE 139391 > PV UUID QY8V1P-uCni-Va3b-2Ypl-7wP9-lEtl-ofD05G > > --- Physical volume --- > PV Name /dev/sdc1 > VG Name lvm > PV Size 1.23 TB / not usable 3.77 MB > Allocatable yes > PE Size (KByte) 4096 > Total PE 321503 > Free PE 87135 > Allocated PE 234368 > PV UUID E09ufG-Qkep-I4hB-3Zda-n7Vy-7zXZ-Nn0Lvi > >So, a 1.75 TB virtual disk on each server, 81% allocated. > >The part that really confuses me is that the 2 500GB drives seem >to always have 10 times as many writes going on as the 1.23TB >drives. (Parts of the iostat output removed for readability) > >mda01:root:> iostat | head -13 >Linux 2.6.18-128.1.10.el5 (mda01.adhost.com) 05/26/2011 > >avg-cpu: %user %nice %system %iowait %steal %idle > 0.10 0.04 2.09 19.41 0.00 78.37 > >Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn >sda 3.21 25.87 73.52 36074352 102536956 >sda1 0.00 0.00 0.00 2566 60 >sda2 0.00 0.00 0.00 2664 448 >sda3 3.21 25.86 73.52 36066666 102536448 >sdb 336.60 205.27 5675.40 286280151 7915188839 >sdc 27.56 148.08 622.15 206517470 867684260 >sdc1 27.56 148.08 622.15 206514758 867684260 > > >mda02:root:> iostat | head -13 >Linux 2.6.18-128.1.10.el5 (mda02.adhost.com) 05/26/2011 > >avg-cpu: %user %nice %system %iowait %steal %idle > 0.10 0.05 1.87 12.33 0.00 85.65 > >Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn >sda 2.84 18.23 68.26 25339938 94863896 >sda1 0.00 0.00 0.00 2920 56 >sda2 0.00 0.00 0.00 2848 1568 >sda3 2.84 18.23 68.26 25331994 94862272 >sdb 333.45 109.90 5679.15 152727845 7892497866 >sdc 29.93 124.20 588.41 172601220 817732660 >sdc1 29.93 124.20 588.41 172598796 817732660 > > >We have checked the network I/O and the NIC's. There are no errors, no >dropped packets, no overruns, etc. The NIC's look perfect. > >We have run rkhunter and chkrootkit on both machines and they found nothing. > >RedHat 5.3 (2.6.18-128.1.10.el5) >DRBD 8.3.1 >Heartbeat 2.1.4 > >Again, any ideas about what is happening, and/or additional diagnostics >we might run would be much appreciated. >Thank you. > - Richard > >_______________________________________________ >drbd-user mailing list >drbd-user at lists.linbit.com >http://lists.linbit.com/mailman/listinfo/drbd-user