[DRBD-user] drbdadm verify always report oos

Fri May 20 11:29:52 CEST 2016

On Fri, May 20, 2016 at 09:48:48AM +0800, d tbsky wrote:
> 2016-05-20 3:44 GMT+08:00 Matt Kereczman <matt at linbit.com>:
> > It could very well be a sort of race condition like you mentioned earlier.
> > If the block is written on the Primary, but the verify is checksumming the
> > backing disks before the write makes it down to disk on the Secondary (block
> > is on the wire, in a buffer, etc), the verify could be coming up with false
> > positives.
> >
> > This is described in one of LINBIT's blog posts found here:
> >   https://blogs.linbit.com/p/138/trust-but-verify/
> 
> Hi:
>     the drbd resource is idle when verify/resync. the vm running above
> it is shutdown, so no one is using it. but it still can not get
> verify/resync until no other resource/volume is doing verify.
> so the race condition here is the verify process between different
> resource/volume. is that possible in theory? I need to do more
> testing. but one thing I can sure is:
> if the resource/volume verify process report 0 oos, then it won't be
> affected, it will always report 0 oos under next verify/resync.

To clarify: to my knowledge, there is NO such race in the online verify,
neither in the resync process.
Because we first lock out application requests within the ranges we are
going to do verify or resync requests on.  (The mentioned blog post had
half-sentence statement that could be understood otherwise, but has been
corrected now.)

Let me try to sketch the life cycle of some block on DRBD.

Assuming we have exactly two replicas,
and they start out as identical.

Also assume that we are replicating synchronously (protocol C),
and we have all normal operation, no IO errors or anything.

To change some block,
some user (application, VM, file system, benchmark, whatever)
of the DRBD would prepare some data in some buffer,
and then submit that to be written at some offset.

DRBD now passes down that buffer and offset to its local IO subsystem,
and in parallel passes it to the network stack.

Some time later, the local IO subsystem will notify the completion.
Some time later, the peer node will send an ACK that his IO subsystem
has completed the request as well.

DRBD collects both local and remote completion notifications,
which may happen in any order, and only when we know both are
written, we complete to upper layers (notify the original submitter
that the write is complete).

We would expect that after such completion,
the affected blocks of the replicas are again identical.

There are a few ways for blocks to end up being NON-identical
on the two replicas.

 * data corruption

   Bit flips, byte flips, other corruption, on one of the involved
   layers, hardware or software, within the network components (drivers,
   firmware, hardware, wires, active and passive network components on
   the path...), the local IO subsystem (which may again have both
   software and hardware layers), the memory bus, the RAM itself ...

   We have seen most combinations of these in real life.

 * Someone bypassing DRBD and writing directly to the backend(s)

   We have seen that a lot, especially with virtualization,
   where people set up DRBD, but somehow have their device-mapper
   or virtualization stack not use the DRBD but its backing devices.
   If DRBD cannot see the request, it cannot replicate it.

 * Network disconnects and IO errors

   But DRBD handles these, will resynchronize at the next opportunity,
   and things should be identical again after that next resync.

 * no "stable pages"

   If the content of the buffer that was originally submitted changes
   after that has been submitted, but before we notify completion, the
   local IO subsystem and the network may DMA (or otherwise use)
   different generations of that buffer to their respective destinations.

   Assumption would be that if someone "changes" such a buffer while it
   is being written, that someone would later re-submit the changes, and
   eventually the affected block would become identical again,
   because there would be no further change during the last write.

   In that case this would only be a temporary state.

   But sometimes that re-submit does not really happen.
   Which may even be "legal", for example if that was for some
   already-deleted temp file, file system optimizations could
   decide that there is no reason to do the actual write-out,
   unless there is memory pressure. Or some application manages
   its own data buffers, and uses direct-IO to submit them,
   then does further changes, without explicitly re-submitting,
   and without waiting for the write to be complete.
   Or someone else has mmap'ed the same stuff,
   and keeps modifying it, without ever explicitly syncing.
   BTW, behaviour changes with kernel versions.

   In which case this can become a permanent state -- until the next
   stable write to the same block.

Differing data caused by lack of stable pages is usually harmless,
because it typically affects areas which are "unreachable".
But I know of no easy-to-automate way to decide which of these caused
the data to be different just by looking at it during online-verify.

So there.
What next?

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed