[DRBD-user] read i/o path question

Tue Jul 6 01:01:05 CEST 2004

/ 2004-07-05 10:32:33 -0700
\ Alan Jones:
> Hi Philipp,
> 
> > 1. FS starts read.
> > 2. DRBD has to ship the read to the peer
> > 3. FS starts write to the same location. (Which does not make sense!)
> > 4. DRBD signals the completion of the two operations, note that DRBD
> >    will return the old data for the read request.
> > 
> > A) No FS issues such ussage patterns.
> > B) Which data should be returned in this read anyway? The old, or the new
> >    version ?
> 
> If the read were issued after the write returned, the application can expect
> the new data.  I don't believe there is any requirement otherwise.  Either
> new or old should be fine; maybe even a combination.  For example, uncompleted
> writes after a crash are not gauranteed to be atomic.  The only practical use 
> of concurent I/O to the same block that I have seen is when a writer wants to 
> continually to update some date, for example, a logging transaction number.  
> For this to work, the writer needs to serialize it's calls to make_request(); 
> and perhaps you might be required to order the writes.

we [try to... :) ] guarantee strict write ordering on both nodes.
anyways.... it does not happen, and if it does nevertheless,
it is not our fault, the result is "theoretically undefined",
and thats what that user gets back.

> My other question relates to a power failure senario.  Both primary and
> secondary have valid copies, but may not be consistent due to uncompleted
> writes.  After rebooting it is possible to provide service while 
> synchronizing *and* continue service should either system fail.  In doing
> so you need to delay returning reads until after the blocks read have been
> synchronized.  This way, an application that reads a block twice can expect
> the same data, even if the copy the data was read from fails inbetween.  This
> is a worthy goal if it is not already designed for, but not a quick fix.

this needs deep thought. some file systems dynamically relocate even
their meta data. so you can not simply continue service on a sync target.

but we talk already double failure here. maybe we are able to cope with
certain kinds of multiple failure scenarios --
preparing for "all of them" is ... complex.
and not our primary developement target.

	lge