Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Felix, Well, I spoke too soon, load went up again late mid-day, (1:30PM PST), although not as bad as the previous week. Load is still high (9 to 11) but is not causing big problems at the moment. So, we are still looking for answers and solutions to this (now) intermittent problem. If anyone has any ideas or suggestions, we would appreciate hearing them. If there is any diagnostic tool that might help, please let us know that, too. Thanks. - Richard At 12:20 AM 5/18/2011, Felix Frank wrote: >On 05/18/2011 04:09 AM, Richard Stockton wrote: > > Hi drbd folks, > > > > We have been using DRBD with heartbeat for several years now. > > The load averages on our RedHat servers have almost always stayed > > below 1.0. Suddenly last week the loads jumped up to 12-14 on > > both servers. They go down at night, when the usage goes down, > > but by mid-day every business day they are back to 14. > > > > We don't see anything out of the ordinary in logs, no drive > > warning lights, no degraded RAID, nothing especially out of > > bounds in iostat, NFS is running smoothly, tcpdump doesn't > > show any nastiness (that we can see).... > > We have run out of ideas. > > > > Our setup is 2 disk arrays, with each server the fail-over for > > the other. These are POP/IMAP accounts with half the alphabet > > on one server and half on the other. If one server fails, the > > other one takes over all of the alphabet. Each letter is a > > separate resource, with a separate IP, etc. The size of the > > letter partitions range from 5G to 120G. The pop/imap servers > > access the letters via NFS3. > > > > RedHat 5.3 (2.6.18-128.1.10.el5) > > DRBD 8.3.1 > > Heartbeat 2.1.4 > > > > Has anyone seen this sort of behavior before, and if so, what > > the heck is happening? We would appreciate any suggestions. > >Is NFS mounted sync or async? NFS is mounted "sync" (NFS3 default, I believe). >Which I/O scheduler is in use for your DRBDs (i.e., for their backing >devices)? cfg (also the default, I believe) Now for the interesting part. This problem had been occurring every day for about a week, so I joined this list yesterday and posted my issue. When I came to work this morning the problem had magically subsided! Loads are now back to 2.0 or less, even during the busy times. This is still slightly higher than normal, but certainly not enough to adversely effect performance. At this point we are assuming that some rebuild was happening in the background, and because of the somewhat large disk arrays, (1.25 TB each), it just took a very long time. Either that or just the act of joining the list frightened the system into submission. [grin] In any case, all appears to be good now. However, we would still love to know exactly what was happening, so we can be prepared to deal with it if it ever happens again. If anyone has a clue, please share. Thanks again. - Richard