Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Lars, On 27/05/14 20:31, Lars Ellenberg wrote: >> The system logs PLC-generated process data every 5 seconds, and at two >> times of the day, at midnight and midday, it misses a sample with the >> logging taking 6 seconds. There's no obvious CPU spike at this time, so >> my hunch is I/O, and so I'm looking at ways to try and improve this. > > Funny how if "something" happens, > and there is DRBD anywhere near it, > it is "obviously" DRBD's fault, naturally. No, it's not "obviously" DRBD's fault. It is a factor, as is the CPU. Rather, it's the network and/or disk, of which DRBD is reliant both of these, and (to a lesser extent) CPU time. I'm faced with a number of symptoms, and so it is right I consider *all* factors, including DRBD and the I/O subsystems that underpin it. >> iotop didn't show any huge spikes that I'd imagine the disks would have >> trouble with. Then again, since it's effectively polling, I could have >> "blinked" and missed it. > > If your data gathering and logging thingy misses a sample > because of the logging to disk (assuming for now that this is in fact > what happens), you are still doing it wrong. > > Make the data sampling asynchronous wrt. flushing data to disk. Sadly how it does the logging is outside my control. The SCADA package is one called MacroView, and is made available for a number of platforms under a proprietary license. I do not have the source code, however it has been used successfully on quite a large number of systems. The product has been around since the late 80's on numerous Unix variants. Its methods may not be "optimal", but they seem to work well enough in a large number of cases. The MacroView Historian basically reads its data from shared memory segments exported by PLC drivers, computes whatever summary data is needed then writes this out to disk. So the process is both I/O and possibly CPU intensive. I can't do much about the CPU other than fiddling with `nice` without a hardware upgrade (which may yet happen; time will tell). I don't see the load-average sky rocketing which is why I suspected I/O: either disk writes that are being bottle-necked by the gigabit network link, or perhaps the disk controller. The DRBD installation there was basically configured and gotten to a working state, there was a little monkey-see-monkey-do learning in the beginning, so it's possible that performance can be enhanced with a little tweaking. The literature suggests a number of parameters are dependent on the hardware used, and this, is what I'm looking into. This is one possibility I am investigating: being mindful that this is a live production cluster that I'm working on. Thus I have to be careful what I adjust, and how I adjust it. >> DR:BD is configured with a disk partition on a RAID array as its backing > > Wrong end of the system to tune in this case, imo. Well, hardware configuration and BIOS settings are out of my reach as I'm in Brisbane and the servers in question are somewhere in Central Queensland some 1000km away. > This (adjusting of the "al-extents" only) is a rather boring command > actually. It may stall IO on a very busy backend a bit, > changes some internal "caching hash table size" (sort of), > and continues. Does the change of the internal 'caching hash table size' do anything destructive to the DR:BD volume? http://www.drbd.org/users-guide-8.3/re-drbdsetup.html mentions that --create-device "In case the specified DRBD device (minor number) does not exist yet, create it implicitly." Unfortunately to me "device" is ambiguous, is this the block device file in /dev, or the actual logical DR:BD device (i.e. the partition). I don't want to create a new device, I just want to re-use the existing one that's there and keep its data. > As your server seems to be rather not-so-busy, IO wise, > I don't think this will even be noticable. Are there other parameters that I should be looking at? Sync-rates perhaps? Once again, the literature suggests this should be higher if the writes are small and "scattered" in nature, which given we're logging data from numerous sources, I'd expect to be the case. Thus following the documentation's recommendations (and not being an expert myself) I figured I'd try carefully adjusting that figure to something more appropriate. Regards, -- Stuart Longland Systems Engineer _ ___ \ /|_) | T: +61 7 3535 9619 \/ | \ | 38b Douglas Street F: +61 7 3535 9699 SYSTEMS Milton QLD 4064 http://www.vrt.com.au