[Drbd-dev] [PATCH 2/2] expand section on throughput tuning to highlight prime usecase of external metadata

Fri Jul 8 20:37:07 CEST 2011

On 08-07-2011 16:17:17, Florian Haas wrote:

> You're adding a third item to the enumeration; so it would be nice if
> you could also rephrase the next paragraph which talks about "the 
> minimum between the two".

Will do.

> You're talking about a battery backup of a cache that is not there. 
> Does not compute. :)

So true, will fix ;)

>> DRBD metadata updates necessary to guarantee +  data-completeness 
>> in case of failure can slow down +  write throughput significantly.
>> If a raw device is normally capable of +  250 MB/s write throughput
>> it is not an anomaly to see writes as slow as + 70 MB/s with DRBD
>> enabled (numbers are for rotational disks). This is +  purely
>> caused by head seeks; 4MB data updates have to be followed by
>> metadata updates +  and the data-writes can only continue after the
>> metadata has been reached the +  platters (caching and write
>> reordering does not help).
> 
> I'm afraid you're missing some context here. DRBD performs the 
> synchronous meta data updates you are referring to only when an AL 
> extent goes hot or cold. It doesn't do so randomly or, as your 
> paragraph seems to imply to a casual reader, every time it has 
> written 4M of data.
> 
> And it is definitely _not_ normal to see 250MB/s write bandwidth drop
> to 70 MB/s. 110 MB/s would be entirely normal if you are replicating
> over Gigagit Ethernet, but that is determined by the bandwidth of the
> replication link, it doesn't have much to do with AL updates.

I think I should explain what I trying to convey, or rather, my mental
image of what happened while I was benchmarking (and saw that huge
performance drop).

My backing device for DRBD is a software raid-0 (two disks), with
'meta-disk internal'. Benchmarking was done by dd'ing a few gigs from
/dev/zero. All this dd-writing makes a lot of new extents hot (one for
every 4MB written?), which has to be remembered in the metadata, with
synchronous writes. Since my backing device is raid-0 and the default
chunk size for that is rather large these days, the (small) metadata
updates aren't spread over the raid-0 disks but are concentrated on one
device, which becomes the bottleneck for the benchmark because it has to
seek all the time.

This is not a cause for concern when you have a hardware battery-backed
cache, as the raid-controller can then delay writing the metadata, but I
don't have that.

I've blktrace-d, blkparse-d and seekwatcher-ed the hell out of this and
the images show exactly that happen, so I dared to write it up like this
without having read the source ;). Lots of linear writes, regularly
interrupted by a seek to synchronously write the metadata.

The slowdown wasn't caused by the interconnection between primary and
secondary, the 70MB/s was measured both in StandAlone and UpToDate (I
bonded 3 GE interfaces for nice syncing bandwidth).

And it was pure benchmarking, no other things happening on the server so
I'd expect that only the benchmark made extents hot.

I of course do not know the exact criteria that mark extents hot, if
what I described above is not an accurate description of what happens,
please correct me.

But the reason I think this should be in the docs is that I reckon that
lots of people would like to 0+"network raid-1" with relatively cheap
hardware, do the simplest of benchmarks and get confused by the
slowdown. Googling this I saw this subject passing over the mailinglist
a couple of times.

> And what you mean by "caching and write reordering does not help" I 
> don't understand at all, can you elaborate please?

The synchronous (barrier?) writes for the metadata, as far as I
understand it from a mailing post from Lars, *must* have reached the
platters before the linear dd-writing can continue. So no enabling of
write caches, NCQ or tuning of elevators is going to help.

However, if you think that the paragraph now implies that *every* write
randomly makes extents hot then I should do some polishing ;)

>> +[[s-tune-external-metadata]]

[...]

> This section would be ok, but it's still missing the steps to dump 
> the existing metadata and restore it onto the new metadata device. 
> Can you add that and repost the patch please?

Ah, I hadn't thought of that scenario (am using a raid-1 for the
metadata). Is this along the lines of:

drbdadm down [resource]
drbdadm dump-md [resource] > savefile
[change meta-disk]
drbdmeta /dev/drbdX v08 [metadevice] 0 restore-md savefile

?

Is the index 0 correct usage when using flexible-meta-disk?

Maarten.