[Drbd-dev] Handling on-disk caches
philipp.reisner at linbit.com
Mon Nov 12 13:39:59 CET 2007
On Wednesday 07 November 2007 04:54:02 Graham, Simon wrote:
> A few months ago, we had a discussion about how to handle systems with
> on-disk caches enabled in the face of failures which can cause the cache
> to be lost after disk writes are completed back to DRBD. At the time,
> the suggestion was to rely on the Linux barrier implementation which is
> used by the file systems to ensure correct behavior in the face of disk
> I've now had time to get back to this and review the Linux barrier
> implementation and it's become clear to me that the barrier
> implementation is insufficient -- imagine the case where a write is
> being done, it completes on the secondary (but is still in disk cache
> there), then we power off this node -- NO errors are reported to Linux
> on the primary (because the other half of the raid set is still there,
> the original IO completes successfully BUT we have a difference side to
> So a failure of the secondary is NOT reflected back to linux and
> therefore we can get out of sync in a way that does not track the blocks
> that need to be resynced independent of the use of barriers.
> Consider the following sequence of writes:
>    [barrier]  
> If we've processed  through  and the writes have completed on both
> primary and secondary but the data is sitting in the disk cache and then
> the secondary is powered off, the following occurs:
> 1. The primary doesn't return any error to Linux
> 2. The primary goes ahead and processes the [barrier] (which flushes
> - to disk then
> performs  and  and includes the blocks covered by these in the
> DRBD bitmap.
> 3. Now the Secondary comes back -- we ONLY resync  and  even
> though - never made it
> to disk (because we didn't execute the [barrier] on the secondary)
Right. So far I completely agree.
> I think the solution to this consists of a number of changes:
> 1. As suggested previously, DRBD should respect barriers on the
> secondary (by passing the appropriate
> flags to the secondary) -- this will handle unexpected failure of the
Right. We should do that.
I think that we do that for the BIO_RW_BARRIER, and the BIO_RW_SYNC
> 2. Meta-data updates (certainly the AL but possibly all meta-data
> updates) should be
> issued as barrier requests (so that we know these are on disk before
> issuing the
> associated writes) (I don't think they are currently)
Right. We currently do not use BIO_RW_BARRIER here, but we should do so.
> 3. DRBD should include the area addressed by the AL when recovering from
> an unexpected
> secondary failure. There are two approaches for this:
> a) Maintain the AL on both sides - when the secondary restarts, add
> the AL to the
> set of blocks needing to be resynced as is done on the primary
> b) Add the current AL to the bitmap on the primary when it loses
> contact with the
> The second is probably easier and is, I think, just as effective --
> even if the primary
> fails as well (so we lose the in memory bitmap), when it comes back it
> WILL add the on-disk
> AL to the bitmap and we wont resync until it comes back...
For item 3 I have an other opinion.
On the primary we have a data structure called the "transfer log" or tl
in the code. Up to now this was mainly important for protocol A and B.
It is a data structure conaining objects for all our self-generated
drbd-barriers on the fly, and objects for all write requests between
If we loose connection in protocol A or B we need to mark everything
we find in the transfer_log as out-of-sync in the bitmap.
When we also do this for protocol C, _AND_ use BIO_RW_BARRIER for
doing writing on the secondary we have solved the issue you
described in the first part of the mail.
I took this as occasion to write down what we are currently up
to in development of DRBD-8.2.
1 Online Verify.
Release the online-verify code from drbd-plus to drbd-8.2, creating a new
protocol version by the way.
2 Hot cache.
Finish the hot cache feature. With this feature enabled DRBD updates the
correct block caches (page cache) on the secondary node, as data gets
written and read on the primary.
Rationale: On Database machines with huge amounts of RAM, the database
can only deliver reasonable performance if Linux's disk caches are hot.
With a conventional DRBD cluster for such a database, the performance
of the database is after a switchover insufficient, since the caches
on the secondary machine are cold.
3 write quorum of 2.
There are users that want to use DRBD to mirror data but do not want
it to continue in case the connection to the secondary is lost. Such
a system is not an HA-system but an always redundant system. It should
freeze IO in case the connection to the secondary is lost, or the
the local disk gets detached. And thaw IO as soon as both pathes
are available again.
4 Configurable write quorum weights.
For OCFS2/GFS users it makes even sense to have configurable weights
for the write quorum. So that one can setup a cluster in that node
A continues to run but node B freezes its IO when the brain splits.
I should mention that number 1 and 2 are already in the works and
will soon appear in DRBD-8.2.
Now I added to the list:
5 Use the kernel's write barriers
As the support for write barriers is now available (this holds
true for hardware as for the Linux kernel) we should make use of
* Use BIO_RW_BARRIER writes for updates to our meta-data-superblock.
* Use BIO_RW_BARRIER for writes to the AL.
* Implement the algorithm descibed in section 6 of
* Delay setting of RQ_NET_DONE in the request objects until the right
BarrierAck comes in, also for protocol C.
Does this make sense ?
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
More information about the drbd-dev