Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, I'm trying to build a HA cluster here. Each node has 8 2.66GHz cpu cores, 24GB RAM and 8 1TB SATA drives behind a LSI (Fusion MPT) SAS 1068E controller. Interconnection is via one of 4 1GE interfaces, directly. Kernel is 2.6.22.18 and DRBD is 8.0.11, the storage device in question is a 3TB MD RAID5 spread across all 8 drives. The native results for this device using ext3 and bonnie for benchmarking are: --- Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP borg00a 50000M 120486 36 87998 17 535665 44 390.9 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 128 74265 90 +++++ +++ 83659 100 71540 88 +++++ +++ 81619 99 --- The same test done on the resulting (UpToDate) drbd device: --- Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP borg00a 50000M 41801 13 39659 11 413367 37 397.7 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 128 78847 95 +++++ +++ 86936 99 78722 95 +++++ +++ 63054 76 --- The acronym WTF surely did cross my lips at this result. Only 33% of the original write speed? Using only about 400Mbit/s of the link's capacity and most definitely not being CPU bound? And while a 410MB/s read speed is just fine still, how does DRBD manage to loose about 20% speed in READs??? I probably should have seen this coming when the syncer on the initial build only managed to get a bit shy of 50MB/s throughput, even though it was permitted 160 (I was pondering bonding 2 interfaces, but that seems to be a wasted effort now). The interlink is fine and completely capable of handling the full capacity one would expect from it. The only tuning done was to set the MTU to 9000 since the NPtcp (netpipes-tcp) results showed a slight improvement with this setting (800Mbit/s @128k message size versus 840/Mbit/s). A ftp transfer clearly and happily gobbled up all the bandwidth: --- 3020690000 bytes sent in 24.95 secs (118231.8 kB/s) --- I ran ethstats in parallel to all these tests and its numbers confirmed the link utilization to match the test results. According to the NPtcp results a throughput of about 400Mbit/s would equate to a message size of about 16KB, is DRBD really just sending such tiny chunks and thus artificially limiting itself? DRBD conf: --- common { syncer { rate 160M; al-extents 1801; } } resource "data-a" { protocol C; startup { wfc-timeout 0; ## Infinite! degr-wfc-timeout 120; ## 2 minutes. } disk { on-io-error detach; use-bmbv; } net { # sndbuf-size 512k; # timeout 60; # connect-int 10; # ping-int 10; max-buffers 2048; # max-epoch-size 2048; } syncer { } on borg00a { device /dev/drbd0; disk /dev/md5; address 10.0.0.1:7789; meta-disk internal; } on borg00b { device /dev/drbd0; disk /dev/md5; address 10.0.0.2:7789; meta-disk internal; } } --- The test machines (also with a 1GE interlink) on which I tried DRBD before only had an ATA MD RAID1, which was the limiting factor I/O wise, so I never saw this coming. All the above tests/results were repeated several times and only one sample is shown, as they had no significant variations. The machines were totally idle otherwise. What am I missing here? Anything else to tune or look for? I didn't play with the kernel tcp buffers, since obviously neither NPtcp nor ftp were slowed down by those defaults. Regards, Christian -- Christian Balzer Network/Systems Engineer NOC chibi at gol.com Global OnLine Japan/Fusion Network Services http://www.gol.com/