Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello,
I'm trying to build a HA cluster here. Each node has 8 2.66GHz cpu cores,
24GB RAM and 8 1TB SATA drives behind a LSI (Fusion MPT) SAS 1068E
controller. Interconnection is via one of 4 1GE interfaces, directly.
Kernel is 2.6.22.18 and DRBD is 8.0.11, the storage device in question is
a 3TB MD RAID5 spread across all 8 drives. The native results for this
device using ext3 and bonnie for benchmarking are:
---
Version 1.03 ------Sequential Output------ --Sequential Input-
--Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
borg00a 50000M 120486 36 87998 17 535665 44 390.9 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
128 74265 90 +++++ +++ 83659 100 71540 88 +++++ +++ 81619 99
---
The same test done on the resulting (UpToDate) drbd device:
---
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
borg00a 50000M 41801 13 39659 11 413367 37 397.7 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
128 78847 95 +++++ +++ 86936 99 78722 95 +++++ +++ 63054 76
---
The acronym WTF surely did cross my lips at this result.
Only 33% of the original write speed?
Using only about 400Mbit/s of the link's capacity and most definitely not
being CPU bound?
And while a 410MB/s read speed is just fine still, how does DRBD manage
to loose about 20% speed in READs???
I probably should have seen this coming when the syncer on the initial
build only managed to get a bit shy of 50MB/s throughput, even though it
was permitted 160 (I was pondering bonding 2 interfaces, but that seems to
be a wasted effort now).
The interlink is fine and completely capable of handling the full capacity
one would expect from it. The only tuning done was to set the MTU to 9000
since the NPtcp (netpipes-tcp) results showed a slight improvement with
this setting (800Mbit/s @128k message size versus 840/Mbit/s).
A ftp transfer clearly and happily gobbled up all the bandwidth:
---
3020690000 bytes sent in 24.95 secs (118231.8 kB/s)
---
I ran ethstats in parallel to all these tests and its numbers confirmed
the link utilization to match the test results.
According to the NPtcp results a throughput of about 400Mbit/s would
equate to a message size of about 16KB, is DRBD really just sending such
tiny chunks and thus artificially limiting itself?
DRBD conf:
---
common {
syncer { rate 160M; al-extents 1801; }
}
resource "data-a" {
protocol C;
startup {
wfc-timeout 0; ## Infinite!
degr-wfc-timeout 120; ## 2 minutes.
}
disk {
on-io-error detach;
use-bmbv;
}
net {
# sndbuf-size 512k;
# timeout 60;
# connect-int 10;
# ping-int 10;
max-buffers 2048;
# max-epoch-size 2048;
}
syncer {
}
on borg00a {
device /dev/drbd0;
disk /dev/md5;
address 10.0.0.1:7789;
meta-disk internal;
}
on borg00b {
device /dev/drbd0;
disk /dev/md5;
address 10.0.0.2:7789;
meta-disk internal;
}
}
---
The test machines (also with a 1GE interlink) on which I tried DRBD before
only had an ATA MD RAID1, which was the limiting factor I/O wise, so I
never saw this coming.
All the above tests/results were repeated several times and only one
sample is shown, as they had no significant variations. The machines were
totally idle otherwise.
What am I missing here? Anything else to tune or look for? I didn't play
with the kernel tcp buffers, since obviously neither NPtcp nor ftp were
slowed down by those defaults.
Regards,
Christian
--
Christian Balzer Network/Systems Engineer NOC
chibi at gol.com Global OnLine Japan/Fusion Network Services
http://www.gol.com/