[DRBD-user] Adjusting al-extents on-the-fly

Fri May 30 04:00:08 CEST 2014

On 30/05/14 11:13, Lars Ellenberg wrote:
> On Wed, May 28, 2014 at 01:23:55PM +1000, Stuart Longland wrote:
>>>> iotop didn't show any huge spikes that I'd imagine the disks would have
>>>> trouble with.  Then again, since it's effectively polling, I could have
>>>> "blinked" and missed it.
>>>
>>> Make the data sampling asynchronous wrt. flushing data to disk.
>>
>> Sadly how it does the logging is outside my control.  The SCADA package
>> is one called MacroView, and is made available for a number of platforms
>> under a proprietary license.  I do not have the source code, however it
>> has been used successfully on quite a large number of systems.
> 
> Well, IO subsystems may have occasional latency spikes.
> DRBD may trigger, be responsible for, or even cause
> additional latency spikes.
> 
> IF your scada would "catch one sample then synchronously log it",
> particular high latency spikes might cause it to miss the next samle.
> 
> I find that highly unlikely.
> Both that sampling and logging would be so tightly coupled,
> and that the latency spike would take that long (if nothing else is
> going on, and the system is not completely overloaded;
> with really loaded systems, arbitrary queue length and buffer bloat,
> I can easily make the latency spike for minutes).
> 
> As this is "pro" stuff, I think it is safe to assume
> that gathering data, and logging that data, is not so tightly coupled.
> Which leads me to believe that it missing a sample
> has nothing to do with persisting the previous sample(s) to disk.
> 
> Especially if it happens so regularly twice a day noon and midnight.
> What is so "special" about those times?
> flushing logs?  log rotation?

This is the bit we haven't worked out yet.  Yesterday I finally decided
to enable logging on the historian: something I was trying to avoid as I
wasn't sure how close to the limits we were sailing and didn't want the
extra logging load to miss additional samples.

Turns out I needn't have worried, and it highlighted a configuration
issue in the historian: it was attempting to archive a large quantity of
files on the DRBD volume to a non-existent directory.  Having fixed
this, I note there isn't a system alarm saying it missed a sample like
there has been the last week or so.

I'll wait a little longer before I declare that specific problem fixed.

It still raises the question of whether we're still at the limits of the
present configuration, and if a small amount of fine tuning might aid
things.  I agree that this would have been better done back in the
office before commissioning rather than waiting for problems 2.5 years
later.

Hindsight is 20:20. :-)

> You wrote "with the logging taking 6 seconds".
> What exactly does that mean?
> "the logging"?
> "taking 6 seconds"?
> what exactly takes six seconds?
> how do you know?

The exact log message that gets reported is this:
> History processing time :6: exceeded base time :5:.

In other words, processing the log history (which I assume includes
collection and writing to disk) took longer than the base time of 5 seconds.

> Are some clocks slightly off
> and get adjusted twice a day?

Both nodes run ntpd, synchronised to each-other and the public NTP
server pool.

>>> This (adjusting of the "al-extents" only) is a rather boring command
>>> actually.  It may stall IO on a very busy backend a bit,
>>> changes some internal "caching hash table size" (sort of),
>>> and continues.
>>
>> Does the change of the internal 'caching hash table size' do anything
>> destructive to the DR:BD volume?
> 
> No.  Really.
> Why would we do something destructive to your data
> because you change some syncronisation parameter.
> And I even just wrote it was "boring, at most briefly stalls then
> continues IO".  I did not write
> it-will-reformat-and-panic-the-box-be-careful-dont-use.
> 
> But unless your typical working set size is much larger than what
> the current setting covered, this is unlikely to help.
> (257 al-extents correspond to ~ 1GByte working set)
> If it is not about the size, but the change rate, of your working set,
> you will need to upgrade to drbd 8.4.
> 
>> http://www.drbd.org/users-guide-8.3/re-drbdsetup.html mentions that
>> --create-device "In case the specified DRBD device (minor number) does
>> not exist yet, create it implicitly."
>>
>> Unfortunately to me "device" is ambiguous, is this the block device file
>> in /dev, or the actual logical DR:BD device (i.e. the partition).
> 
> So what. "In case .* does not exist yet".
> Well, it does exist.
> So that's a no-op, right?
> 
> Anyways.  That flag is passed from drbdadm to drbdsetup *always*
> (in your drbd version).
> And it does no harm. Not even to your data.
> It's an internal convenience flag.

Ahh okay.  I apologise if I seem to be asking stupid questions or being
overly cautious.  I admit up front that I'm no expert with HA and DRBD,
and a customer's production SCADA system is not the best place to learn.

Time is also not on my side when doing technical support, so setting up
a test system and experimenting wasn't really doable.  Having gotten the
immediate problem over with though, I'll look into procuring some
hardware and experimenting with DRBD.

>> Are there other parameters that I should be looking at?
> 
> If this is about DRBD tuning,
> well, yes, there are many things to consider.
> If there were just one optimal set of values,
> those would be hardcoded, and not tunables.

Indeed.  I read many of them are workload-dependent as well as
hardware-dependent.  What I'm wondering is, are there some metrics we
can gather from places such as /proc/drbd and others, which might give
us clues as to which of these tunables (if any) might help?

/proc/drbd currently shows:
> version: 8.3.7 (api:88/proto:86-91)
> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at node1, 2012-10-24 13:41:01
>  0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
>     ns:41050284 nr:751228 dw:63284196 dr:6638234 al:16176 bm:7166 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0

The two things that stuck out reading the tuning section of the manual
were al-extents, and the TCP send buffer size.  I notice 'al' in
/proc/drbd is up over 16000, I'm not sure if this is cumulative like the
network counters (ns, nr) or not, and when they're reset.

Is there a possibility of showing peak rates, calculated say over the
last minute or so, rather than polling /proc/drbd for that information?

>> Sync-rates perhaps?
> 
> Did you have resync going on during your "interesting" times?
> If not, why bother, at this time, for this issue.
> If yes, why would you always resync at noon and midnight?
> 
>> Once again, the literature suggests this should be higher if the writes
>> are small and "scattered" in nature, which given we're logging data from
>> numerous sources, I'd expect to be the case.
> 
> Sync rate is not relevant at all here.
> Those parameters control the background resynchronization
> after connection loss and re-establishment.
> As I understand, your DRBD is healthy, connected,
> and happily replicating. No resync.

Ahh okay, how often does DRBD send updates to the secondary host?  I
figured since we're using the synchronous protocol (C) that it'd be
pretty much on every write.

It was this basis, that I had a hunch about I/O and latency, since I
knew there was a shared gigabit link involved.

What I did not know at the time, was the nature of the I/O, whether they
be reads or writes.  Evidently in this case, it was mostly reads
(reading data logs) which in all probability, wouldn't be affected by DRBD.

I'd imagine if we did a large number of scattered writes all a sudden,
then the figure may be very different.

>> Thus following the documentation's recommendations (and not being an
>> expert myself) I figured I'd try carefully adjusting that figure to
>> something more appropriate.
> 
> Sure, careful is good.
> Test system is even better ;-)

Yep, as I say this is something I'll look into.  There's two distributed
storage systems we're using now: Ceph and RBD.  Ceph is good for some
things, but for SCADA, a pair of servers with HeartBeat+DRBD to me seems
like the better tool for the job.

> If you really want to improve on random write latency with DRBD,
> you need to upgrade to 8.4. (8.4.5 will be released within days).
> 
> I guess that upgrade is too scary for such a system?

Well, we're looking into an update.  Ubuntu 10.04 comes out of support
in a year or two, and this is one of the few sites that still runs that
OS (most are at 12.04 now).  So we'd do the DRBD upgrade at the same time.

Naturally, there have been lessons learned with this installation that
we can now apply.

> Also, you could use auditctl to find out in detail what is happenening
> on your system. You likely want to play with that on a test system first
> as well, until you get the event filters right,
> or you could end up spamming your production systems logs.

Ahh, brilliant.  I'll do some research and see what can be uncovered.

In any case, thank-you very much for your time.
-- 
Stuart Longland
Systems Engineer
     _ ___
\  /|_) |                           T: +61 7 3535 9619
 \/ | \ |     38b Douglas Street    F: +61 7 3535 9699
   SYSTEMS    Milton QLD 4064       http://www.vrt.com.au