[DRBD-user] massive latency increases from the slave with barrier or flush enabled

Sun Jul 3 10:50:53 CEST 2011

On Fri, Jun 24, 2011 at 02:27:54PM +0100, Phil Stoneman wrote:
> On 23/06/11 10:49, Phil Stoneman wrote:
> >I need a bit of help understanding an issue we're seeing with our DRBD
> >setup. We have two Debian systems, with DRBD 8.3.8. We have a resource
> >configured with protocol A, which backs onto a single SATA disk on each
> >side.
> >
> >I am running the latency test from the DRBD documentation, multiple times:
> >while true; do dd if=/dev/zero of=$TEST_DEVICE bs=512 count=1000
> >oflag=direct; sync; done
> >
> >If I stop drbd and run it against the backing device on each side, it
> >repeats again and again very quickly. If I start DRBD and run it against
> >the DRBD device on the primary side, it runs quickly for about 4 or 5
> >repetitions, then slows right down. Investigation shows pe: is climbing
> >in /proc/drbd, and iostat -mxd 1 on the secondary node shows 100% disk
> >usage for the backing device. Note that this repeats in the other
> >direction if I swap primary/secondary roles. It's only the secondary
> >role that's seeing 100% disk usage, not the primary.
> >
> >When I use no-disk-barrier and no-disk-flushes, the problem goes away
> >entirely - but I'm reluctant to enable this permanently, as they're just
> >normal SATA drives without any battery backup or anything, and there are
> >scary warnings about doing that in the documentation :-)
> 
> After a bunch more testing, it looks like DRBD on the secondary side
> only is not using (or regularly flushing) the write cache of the
> underlying storage. It's not doing this on the primary side, and I
> can actually see a reference to this in the manual[1]: "DRBD uses
> disk flushes for write operations both to its replicated data set
> and to its meta data.".
> 
> I must be honest, I don't completely understand the rationale behind
> utilising the write cache on the primary side but not utilising it
> on the secondary side - it really hurts performance in some use
> cases!
> 
> Still, now that I know what's going on, I'm a little more
> comfortable using no-disk-barrier and no-disk-flushes. It means that
> I might lose data written into the drive's write cache, but that's
> no worse a situation than a normal system using the hard drives
> natively.
> 
> 
> I'm still interested to hear the reason behind why drbd works that
> way though...

WRITE request on Primary to some area not yet marked as "hot" in the
activity log.

  Write an activity log transaction, which marks this area as "hot",
  synchronously, with "BARRIER" or "FLUSH/FUA" if available.

  send the original write over to the peer.
  submit the original write to the local disk.

  with protocol A, once the local disk completes the write,
  signal completion to submitter (upper layers).

  This completion to upper layers also closes the current "epoch",
  or "reorder domain", because the next request submitted may have
  been waiting for this one to complete (write after write
  dependency). So because we closed that epoch, we send over a
  "DRBD barrier", to notify the peer of this reorder-domain boundary.

    If the peer receives such a "DRBD barrier" notification,
    it syncs and flushes its disk, and once that completes,
    acknowledges ("DRBD barrier ack") this to the Primary.
    This is done for two reasons: first, to enforce consistent
    write-after-write dependency behaviour. Second reason
    outlined below.

Next request, potentially again to some not yet "hot" area,
again activity log transaction. To mark one area hot, we typically need
to mark the least recently used area as "cold" again, and so on.

After recovery from Primary crash, we resync at least all areas
that have been marked hot in the log, as well as any blocks
explicitly marked as "changed".

So much for context.

If we mark things "cold", just because there are no outstanding
completions to upper layer, we may forget to sync these areas
after a primary crash (after all, the corresponding requests
may not have reached the peer, or peers disk yet).

Depending on what DRBD protocol was in use, and where requests have been
at the time of crash (still in local buffers?), and possibly
simultaneous crash of both nodes, on the next resync DRBD would assume
areas to be clean and identical on both nodes which potentially are not,
which may even go unnoticed for quite a while, but cause ill effects
on some subsequent switchover.

So we do not mark things cold, until we received the "DRBD barrier-ack"
for any corresponding request that may have still been on flight to
remote disk.

So if you tell DRBD to not flush on such reorder-domain-boundaries,
and you have volatile caches involved, you not only lose data that had
been in the volatile caches, you may cause the DRBD replicas to diverge,
silently, after any primary crash event.

Note that we do not need to flush more often on the Primary,
because we flush on bitmap/activity log update,
and there will be no writes in flight to areas not marked
hot in the activity log, and we resync all those "hot" areas
after primary crash anyways.

Potentially, we could do less flushes on the secondary as well,
but probably at the cost of increased additional latency
for any request that involves activity log transactions.

Also note that, using a single threaded dd benchmark, that is the worst
possible load: as you get single requests, and each will complete before
the next one is issued, you get each request in its own "reorder
domain", all separated by a "DRBD barrier", and thus each flushed
through the cache on the secondary individually.

Bottom line: get a good controller, avoid volatile caches.
DRBD is NOT to help you get away with crappy hardware.
But to help you, in case things break, despite of decent hardware.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed