[DRBD-user] DRBD performance (only uses 40% of a GE link)

Lars Ellenberg lars.ellenberg at linbit.com
Mon Mar 3 13:41:54 CET 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Sat, Mar 01, 2008 at 02:49:30PM +0900, Christian Balzer wrote:
> 
> Hello,
> 
> I'm trying to build a HA cluster here. Each node has 8 2.66GHz cpu cores,
> 24GB RAM and 8 1TB SATA drives behind a LSI (Fusion MPT) SAS 1068E
> controller. Interconnection is via one of 4 1GE interfaces, directly. 
> Kernel is 2.6.22.18 and DRBD is 8.0.11, the storage device in question is
> a 3TB MD RAID5 spread across all 8 drives. The native results for this
> device using ext3 and bonnie for benchmarking are:
> ---
>  Version  1.03       ------Sequential Output------ --Sequential Input-
> --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> borg00a      50000M           120486  36 87998  17           535665  44 390.9   1
>                     ------Sequential Create------ --------Random Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>                 128 74265  90 +++++ +++ 83659 100 71540  88 +++++ +++ 81619  99
> ---
> 
> The same test done on the resulting (UpToDate) drbd device:
> ---
> Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> borg00a      50000M           41801  13 39659  11           413367  37 397.7   1
>                     ------Sequential Create------ --------Random Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>                 128 78847  95 +++++ +++ 86936  99 78722  95 +++++ +++ 63054  76
> ---

50GB test size by far exceeds the
"activity log size" you configured,
which covers 1801 * 4M = about 7G only.

so you get constant meta data transactions,
which are synchronous sector writes including barriers.

you can see this in the "al:" numbers increasing,
as well as the "hit/misses/changed" ratio in the act_log line.

synchronous writes don't have the best latency characteristics
with md raid5.

> The acronym WTF surely did cross my lips at this result.
> Only 33% of the original write speed?
> Using only about 400Mbit/s of the link's capacity and most definitely not
> being CPU bound? 

drbd is perfectly capable of saturating a 1GBit link.
maybe suggestions below help.

> And while a 410MB/s read speed is just fine still, how does DRBD manage
> to loose about 20% speed in READs???

don't know, typically it does not.

> I probably should have seen this coming when the syncer on the initial
> build only managed to get a bit shy of 50MB/s throughput, even though it
> was permitted 160 (I was pondering bonding 2 interfaces, but that seems to
> be a wasted effort now).
> 
> The interlink is fine and completely capable of handling the full capacity
> one would expect from it. The only tuning done was to set the MTU to 9000
> since the NPtcp (netpipes-tcp) results showed a slight improvement with
> this setting (800Mbit/s @128k message size versus 840/Mbit/s).
> A ftp transfer clearly and happily gobbled up all the bandwidth:
> ---
> 3020690000 bytes sent in 24.95 secs (118231.8 kB/s)
> ---
> 
> I ran ethstats in parallel to all these tests and its numbers confirmed
> the link utilization to match the test results.
> 
> According to the NPtcp results a throughput of about 400Mbit/s would
> equate to a message size of about 16KB, is DRBD really just sending such
> tiny chunks and thus artificially limiting itself?

drbd "message size" is just a few _bytes_ for most non-data
comunication, and the size of the bio (512Byte to 32KB,
depending on what file system sits on top) plus some header bytes.

though, since this is not iscsi, but drbd, I don't think what you
connect to "message size" does really apply here at all, at least not
for streaming writes.

> DRBD conf:
> ---
> common {
>   syncer { rate 160M; al-extents 1801; }
> }
> resource "data-a" {
>   protocol C;
>   startup {
>     wfc-timeout         0;  ## Infinite!
>     degr-wfc-timeout  120;  ## 2 minutes.
>   }
>   disk {
>     on-io-error detach;
>     use-bmbv;
>   }
>   net {
>     # sndbuf-size 512k;	
>     # timeout           60;
>     # connect-int       10;
>     # ping-int          10;
>     max-buffers     2048;
>     # max-epoch-size  2048;

suggestions for this setup:
	sndbuf-size 1M # or even more, if you try with bonding.
	max-buffers 8000 # or more
	max-epoch.size # equal to max-buffers

	unplug-watermark
	# try it to be equal to max-buffers,
	# or half of it,
	# or something like that.
	#
	# also try the oposite:
	# make it small, 64, 128, 800.

try different io schedulers for your physical drives.
I still like deadline, because it is simple, and the few parameters
it has are straight forward to tune.  also try setting read-ahead
smallish on your physical devices, and largish on the md.

>   }
>   syncer {
>   }
> 
>   on borg00a {
>     device      /dev/drbd0;
>     disk        /dev/md5;
>     address     10.0.0.1:7789;
>     meta-disk internal;
>   }
> 
>   on borg00b {
>     device     /dev/drbd0;
>     disk       /dev/md5;
>     address    10.0.0.2:7789;
>     meta-disk internal;
>   }
> }
> ---
> 
> The test machines (also with a 1GE interlink) on which I tried DRBD before
> only had an ATA MD RAID1, which was the limiting factor I/O wise, so I
> never saw this coming.
> 
> All the above tests/results were repeated several times and only one
> sample is shown, as they had no significant variations. The machines were
> totally idle otherwise. 
> 
> What am I missing here? Anything else to tune or look for? I didn't play
> with the kernel tcp buffers, since obviously neither NPtcp nor ftp were
> slowed down by those defaults.
> 
> Regards,
> 
> Christian
> -- 
> Christian Balzer        Network/Systems Engineer                NOC
> chibi at gol.com   	Global OnLine Japan/Fusion Network Services
> http://www.gol.com/
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
> 

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list