[Drbd-dev] [PATCH 2/2] expand section on throughput tuning to highlight prime usecase of external metadata
mrten+drbd at ii.nl
Fri Jul 8 20:37:07 CEST 2011
On 08-07-2011 16:17:17, Florian Haas wrote:
> You're adding a third item to the enumeration; so it would be nice if
> you could also rephrase the next paragraph which talks about "the
> minimum between the two".
> You're talking about a battery backup of a cache that is not there.
> Does not compute. :)
So true, will fix ;)
>> DRBD metadata updates necessary to guarantee + data-completeness
>> in case of failure can slow down + write throughput significantly.
>> If a raw device is normally capable of + 250 MB/s write throughput
>> it is not an anomaly to see writes as slow as + 70 MB/s with DRBD
>> enabled (numbers are for rotational disks). This is + purely
>> caused by head seeks; 4MB data updates have to be followed by
>> metadata updates + and the data-writes can only continue after the
>> metadata has been reached the + platters (caching and write
>> reordering does not help).
> I'm afraid you're missing some context here. DRBD performs the
> synchronous meta data updates you are referring to only when an AL
> extent goes hot or cold. It doesn't do so randomly or, as your
> paragraph seems to imply to a casual reader, every time it has
> written 4M of data.
> And it is definitely _not_ normal to see 250MB/s write bandwidth drop
> to 70 MB/s. 110 MB/s would be entirely normal if you are replicating
> over Gigagit Ethernet, but that is determined by the bandwidth of the
> replication link, it doesn't have much to do with AL updates.
I think I should explain what I trying to convey, or rather, my mental
image of what happened while I was benchmarking (and saw that huge
My backing device for DRBD is a software raid-0 (two disks), with
'meta-disk internal'. Benchmarking was done by dd'ing a few gigs from
/dev/zero. All this dd-writing makes a lot of new extents hot (one for
every 4MB written?), which has to be remembered in the metadata, with
synchronous writes. Since my backing device is raid-0 and the default
chunk size for that is rather large these days, the (small) metadata
updates aren't spread over the raid-0 disks but are concentrated on one
device, which becomes the bottleneck for the benchmark because it has to
seek all the time.
This is not a cause for concern when you have a hardware battery-backed
cache, as the raid-controller can then delay writing the metadata, but I
don't have that.
I've blktrace-d, blkparse-d and seekwatcher-ed the hell out of this and
the images show exactly that happen, so I dared to write it up like this
without having read the source ;). Lots of linear writes, regularly
interrupted by a seek to synchronously write the metadata.
The slowdown wasn't caused by the interconnection between primary and
secondary, the 70MB/s was measured both in StandAlone and UpToDate (I
bonded 3 GE interfaces for nice syncing bandwidth).
And it was pure benchmarking, no other things happening on the server so
I'd expect that only the benchmark made extents hot.
I of course do not know the exact criteria that mark extents hot, if
what I described above is not an accurate description of what happens,
please correct me.
But the reason I think this should be in the docs is that I reckon that
lots of people would like to 0+"network raid-1" with relatively cheap
hardware, do the simplest of benchmarks and get confused by the
slowdown. Googling this I saw this subject passing over the mailinglist
a couple of times.
> And what you mean by "caching and write reordering does not help" I
> don't understand at all, can you elaborate please?
The synchronous (barrier?) writes for the metadata, as far as I
understand it from a mailing post from Lars, *must* have reached the
platters before the linear dd-writing can continue. So no enabling of
write caches, NCQ or tuning of elevators is going to help.
However, if you think that the paragraph now implies that *every* write
randomly makes extents hot then I should do some polishing ;)
> This section would be ok, but it's still missing the steps to dump
> the existing metadata and restore it onto the new metadata device.
> Can you add that and repost the patch please?
Ah, I hadn't thought of that scenario (am using a raid-1 for the
metadata). Is this along the lines of:
drbdadm down [resource]
drbdadm dump-md [resource] > savefile
drbdmeta /dev/drbdX v08 [metadevice] 0 restore-md savefile
Is the index 0 correct usage when using flexible-meta-disk?
More information about the drbd-dev