[DRBD-user] low write throughput

Fri Jan 8 15:48:53 CET 2010

I'm preparing a classic 2-node cluster and cannot understand the results I'm getting for write throughput.

Hardware & software:
* Dell PowerEdge R300 servers, 1x Xeon X3323 (2.50GHz qc), 16GiB ram, Dell SAS/6 (LSI SAS1068E) PCIe x8 controller, 2x Seagate Barracuda ES.2 SAS 750GB disks
* Broadcom NetXtreme II BCM5709 PCIe x4 dual gigabit card (rx and tx checksum offload, scatter-gather, tcp segmentation offload), with latest bnx2 driver v1.9.20b (MSI-X interrupts)
* Linux-2.6.30.9 from 'vanilla' sources. But I got exactly the same results with XenLinux-2.6.18.8 under Xen-3.4.2 hypervisor (the cluster will eventually end up running Xen)
* DRBD-8.0.16 (I wasn't able to get any 8.3.x working with XenLinux-2.6.18.8... gentoo portage will compile 8.3.6 but then it goes Oooops)
* I got the same results when testing drbd-8.3.6 with linux-2.6.30.9 (with the same drbd.conf I used with 8.0.16)

Configuration:
* linux software raid1 partition for Domain-0 filesystem (6GB), swap partition (0.5G), drbd partition (~690G)
* two 'cross' connections between the NetXtremeII dual gigabit cards, jumbo frames (MTU=9000), rr-balance bonding
* drbd is in Primary/Secondary mode
* all benchmarks were made on a single ext3 filesystem made on the whole device (/dev/sda4 for raw disk, /dev/md1 for raid0, /dev/drbd0 for drbd), mounted noatime
* all benchmarks were repeated many times (16 runs for bonnie++, 8 runs for dd) and variance was always negligible, I got really consistent results between runs
* all drbd benchmarks were made with the resource UpToDate and connected
* I did an mkfs.ext3 and a reboot prior to any bonnie++ benchmark

Baseline performance was exactly as expected:
* netperf measured 1968MBit/s through the 2x1000Gbit link (and 990MBit/s with one cable disconnected)
* dd and bonnie++ show 116MiB/s read and 98MiB/s write throughput on the single disk (fs on /dev/sda4)
* dd and bonnie++ show 214MiB/s read and 190MiB/s write thoughput on software raid0 (fs on /dev/md1 made up of sda4 and sdb4)

Syncronization performance with drbd is very good:
* raising the syncer rate I was able to get the two nodes sync'ing both disks (sda/srv1 -> sda/srv2, and the opposite sdb/srv2 -> sdb/srv1) at the same time at >80MiBytes/sec (there was nearly 700MBit/sec in each direction on the network link), both disks where UpToDate in a little over 2 hours.
* that should confirm the systems don't have any bus bottleneck in both network and sas cards

When using DRBD, read throughput is around 116MiB/s as expected, but write throughput is much lower. The next figures were taken using dd with bs=512M and count=1.
- with the default configuration (no relevant options in drbd.conf) I get 48MB/sec with protocol C and 49MB/sec with protocol A
- increasing max-buffers and max-epoch-size (with equal values) will actually yeld lower throughput: with protocol C it goes from 48MB/s (2048, default value) to 42MB/s (4000) to 39MB/s (8000)
- increasing sndbuf-size to 512k or 1024k does not change anything
- changing unplug-watermark from a minimum of 16 to a maximum equal to max-buffers only gets irrelevant changes in write performance (+/- 1MiB/sec)
- the above holds true for any combination: sndbuf-size does not change performance with any protocol and any buffers and any unplug-watermark tried, and an increase in max-* always degrades performance
- the good thing is, write speed on resource #1 from node1 does not slow down even while syncing resource #2 from node2 to node1 (as expected, by the way)

DRBD documentation says I should expect nearly the same sequential write throughput with and without drbd, but I'm getting less than half of that. I tried tweaking the config without success, I could only actually worsen my situation. Changing between old and new kernels or old and new drbd doesn't seem to make any difference.

I'd like to know if I'm correct in assuming I should get something around 90MiB/sec of write thoughput even with drbd, and if I'm doing anything wrong in configuring drbd or benchmarking it. Any help will be appreciated, thanks.

My plan was to go look at latency figures after checking out throughput, following your documentation, but for reference, I get ~8.4ms latency with drbd and ~8.3ms directly on disk, which I think are correct and expected numbers from these systems.

--
Luca Lesinigo
LM Networks Srl