[DRBD-user] how about read a block that not return to upper application in protocol C?

Lars Ellenberg lars.ellenberg at linbit.com
Tue Sep 6 10:56:30 CEST 2016

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Mon, Sep 05, 2016 at 12:09:31PM +0200, Jan Schermer wrote:
> 
> > On 05 Sep 2016, at 11:50, Lars Ellenberg <lars.ellenberg at linbit.com> wrote:
> > 
> > On Mon, Sep 05, 2016 at 01:16:21AM +0800, Mia Lueng wrote:
> >> Hi All:
> >> In protocol C, a bio will return to upper application(execute
> >> bi_endio()) when local bio is completed and  recieve the data ack
> >> packet from peer.  But if  a write request to block N was submitted
> >> and written to local disk, but not received the data ack from peer, a
> >> read request to the same block N  is comming. The read request will
> >> get the data of block N that was not returned to upper application.
> >> 
> >> Will this cause the application's(eg. oracle) logical error?
> > 
> > If you have dependencies between IO requests,
> > you must not issue the second request,
> > before the first has completed.
> > 
> > Think of local disk only.
> > 
> > You issue a WRITE to block X.
> > Then, before that completed,
> > you issue a READ to block X.
> > (actual, direct, IO requests to the backend device,
> > not catched by some intermediate caching layer)
> > 
> > The result of the READ is undefined.
> > It may return old data, it may return new data,
> > it may even return partially updated data.
> > 
> > Undefined.
> > 
> 
> Actually I'm not sure this is true, depending of course on what you
> mean by "before that completed" - not completed or just not flushed?

I said completed.

The only thing you (the application, or other entity using the block
device) control is the point in time, when you *issue* an IO request.

You do not control when it is dispatched, or when or if it is processed.
You may eventually be asynchronously notified of its completion, which
comes with some "success/error" indicator. If you are still around,
and "listen" for that completion...

If you are doing a "synchronous" write(), that would implicitly wait for
this completion for you.
Obviously read() would have to wait for its completion as well (if it
even goes through to the device, and is not satisfied from some cache;
which you could tell it to do by using "direct" IO).

The relative order in time of the "issue" (kernel knows about it),
"dispatch" (low level device driver sent it to the actual backend
device) "process" (by backend device), and "completion" may all be
different.

"persisted" (on stable, persistent, storage, "flushed") may then come
even later (there may be volatile device caches), and potentially even
requires an additional, explicit "flush" request.

Any logical block device driver (device mapper, MD raid, DRBD, ...)
sits somewhere between issue and dispatch, and may have its own
"process" stage, usually involving "issue" of one or more
"sub"-requests, sometimes only re-mapping and passing-on of the
original request.

There may be Linux IO schedulers between issue and dispatch,
there may be in-device IO schedulers somewhere after dispatch.

If you issue some IO request,
then issue some other IO request without waiting for the first,
the IO subsystem would be in its right to reorder them.

Such is the API (simplified).

> On a local disk even buffered write should cause subsequent reads to
> reflect the new contents, corner case here is DIRECT_IO on write but
> not on read, which is undefined. I'd expect that to be true with
> protocol C even in a multi-node setup, but I'm not sure what e.g.
> shared filesystems expect in this case.

If you want to be sure you read back stuff you have written,
you have to make sure the write has been completed before
issuing the read. That is true for local, replicated, shared,
or any other linux block device.

So the question that should have been asked is:
how do we deal with failures in DRBD,
how do we not lose transactions after a failure.

Simplest case: crash of active single node, no replication:
after reboot, file systems to crash recovery (journal replay,
and/or fsck), applications do crash recovery (going through their
journal/WAL/whatever, transactions being rolled back or forward)

Assume that, at crash time, we had a single IO request "in flight",
which was to persist a specific transaction commit.

That means this commit was still pending,
no one has seeen this as a successful transaction yet.

Depending on the exact timing, the commit record may have made it
to stable storage or not, and crash recovery will find out.

After crash recovery, we will find that transaction
either as rolled back, or as completed.
And both outcomes would be ok.

Now, if we do this on a raid1 with two components,
we may find the commit record on none, one, or both components.
the RAID1 would be responsible for arbitrating, doing its own
recovery (rebuild, resync, whatever you call it), and giving
back consistent data.

If we think of replication with DRBD,
things are not that different: Primary crash, last ("in-flight",
"pending", not yet "completed") write may or may not have made it
to none, one, or both backends.

The crash recovery procedures of file system and application
will still just work the same.

For more complex (multiple) failure scenarios, you'll have to properly
configure DRBD, cluster manager, and the overall system and include
fencing (both on DRBD level and cluster manager level).

Depending on deployment scenario, we are able to do a few tradeoffs.
At least:
 * if in doubt, would you rather be online, even if that risks
   data divergence and may lose transactions,
   or would you rather be offline?
 * are you willing to trade potential transaction loss against
   lower latencies ((a)synchronous replication)?

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed



More information about the drbd-user mailing list