Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Mon, Sep 05, 2016 at 12:09:31PM +0200, Jan Schermer wrote: > > > On 05 Sep 2016, at 11:50, Lars Ellenberg <lars.ellenberg at linbit.com> wrote: > > > > On Mon, Sep 05, 2016 at 01:16:21AM +0800, Mia Lueng wrote: > >> Hi All: > >> In protocol C, a bio will return to upper application(execute > >> bi_endio()) when local bio is completed and recieve the data ack > >> packet from peer. But if a write request to block N was submitted > >> and written to local disk, but not received the data ack from peer, a > >> read request to the same block N is comming. The read request will > >> get the data of block N that was not returned to upper application. > >> > >> Will this cause the application's(eg. oracle) logical error? > > > > If you have dependencies between IO requests, > > you must not issue the second request, > > before the first has completed. > > > > Think of local disk only. > > > > You issue a WRITE to block X. > > Then, before that completed, > > you issue a READ to block X. > > (actual, direct, IO requests to the backend device, > > not catched by some intermediate caching layer) > > > > The result of the READ is undefined. > > It may return old data, it may return new data, > > it may even return partially updated data. > > > > Undefined. > > > > Actually I'm not sure this is true, depending of course on what you > mean by "before that completed" - not completed or just not flushed? I said completed. The only thing you (the application, or other entity using the block device) control is the point in time, when you *issue* an IO request. You do not control when it is dispatched, or when or if it is processed. You may eventually be asynchronously notified of its completion, which comes with some "success/error" indicator. If you are still around, and "listen" for that completion... If you are doing a "synchronous" write(), that would implicitly wait for this completion for you. Obviously read() would have to wait for its completion as well (if it even goes through to the device, and is not satisfied from some cache; which you could tell it to do by using "direct" IO). The relative order in time of the "issue" (kernel knows about it), "dispatch" (low level device driver sent it to the actual backend device) "process" (by backend device), and "completion" may all be different. "persisted" (on stable, persistent, storage, "flushed") may then come even later (there may be volatile device caches), and potentially even requires an additional, explicit "flush" request. Any logical block device driver (device mapper, MD raid, DRBD, ...) sits somewhere between issue and dispatch, and may have its own "process" stage, usually involving "issue" of one or more "sub"-requests, sometimes only re-mapping and passing-on of the original request. There may be Linux IO schedulers between issue and dispatch, there may be in-device IO schedulers somewhere after dispatch. If you issue some IO request, then issue some other IO request without waiting for the first, the IO subsystem would be in its right to reorder them. Such is the API (simplified). > On a local disk even buffered write should cause subsequent reads to > reflect the new contents, corner case here is DIRECT_IO on write but > not on read, which is undefined. I'd expect that to be true with > protocol C even in a multi-node setup, but I'm not sure what e.g. > shared filesystems expect in this case. If you want to be sure you read back stuff you have written, you have to make sure the write has been completed before issuing the read. That is true for local, replicated, shared, or any other linux block device. So the question that should have been asked is: how do we deal with failures in DRBD, how do we not lose transactions after a failure. Simplest case: crash of active single node, no replication: after reboot, file systems to crash recovery (journal replay, and/or fsck), applications do crash recovery (going through their journal/WAL/whatever, transactions being rolled back or forward) Assume that, at crash time, we had a single IO request "in flight", which was to persist a specific transaction commit. That means this commit was still pending, no one has seeen this as a successful transaction yet. Depending on the exact timing, the commit record may have made it to stable storage or not, and crash recovery will find out. After crash recovery, we will find that transaction either as rolled back, or as completed. And both outcomes would be ok. Now, if we do this on a raid1 with two components, we may find the commit record on none, one, or both components. the RAID1 would be responsible for arbitrating, doing its own recovery (rebuild, resync, whatever you call it), and giving back consistent data. If we think of replication with DRBD, things are not that different: Primary crash, last ("in-flight", "pending", not yet "completed") write may or may not have made it to none, one, or both backends. The crash recovery procedures of file system and application will still just work the same. For more complex (multiple) failure scenarios, you'll have to properly configure DRBD, cluster manager, and the overall system and include fencing (both on DRBD level and cluster manager level). Depending on deployment scenario, we are able to do a few tradeoffs. At least: * if in doubt, would you rather be online, even if that risks data divergence and may lose transactions, or would you rather be offline? * are you willing to trade potential transaction loss against lower latencies ((a)synchronous replication)? -- : Lars Ellenberg : LINBIT | Keeping the Digital World Running : DRBD -- Heartbeat -- Corosync -- Pacemaker DRBD® and LINBIT® are registered trademarks of LINBIT __ please don't Cc me, but send to list -- I'm subscribed