[DRBD-user] DRBD sync speed much slower than expected

Mon Feb 23 17:04:54 CET 2015

Hi,

I'm using drbd (8.3.13) in a ganeti 2.11.6 environment.

The system is set up as follows:
* 3 physical machines, 2x 6 core xeon + 96GB RAM each node
(all hosts running Debian Wheezy)
* 4 SSD @ RAID 10 via hardware raid controller
* 500 GByte of this volume as an lvm vg for ganeti (as required for ganeti)
* network between the nodes 2x 1Gbit using port bonding (active/slave
failover, bond mode 0)
* network speed, though is limited to 500 Mbit/s by the DC (money and stuff
-.-*)

My problem is the following. The first, lets say, 2 weeks of setting up the
ganeti cluster (and filling it with kvm VMs), everything used to be fine.
DRBD Volumes were syncing at top speeds as seen in iftop monitoring the
bonding device, which means 400-450 Mbits - no complains.
Then suddenly every installation took much longer than it has recently.
Syncing speeds never raised 130 Mbits from that moment on.

I started searching at the bottom, so network and physical disks. Network
had the desired capacity all the time, with scp running 3 or 4 times faster
than the drbd sync
(even WHILE having a sync in progress). iperf measurements from one node to
another confirmed that:

#################

Client connecting to 10.46.0.2, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------
------------------------------
[  3] local 10.46.0.3 port 47716 connected with 10.46.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   520 MBytes   435 Mbits/sec

#################

The disks are okay too, could not imagine a raid10 ssd raid hitting the
break to 100 Mbit/s anyways, dd to be sure:

#################

# dd if=/dev/zero of=/tmp/foobar bs=1k count=1000000
1000000+0 Datensätze ein
1000000+0 Datensätze aus
1024000000 Bytes (1,0 GB) kopiert, 1,96102 s, 522 MB/s

#################

I've then tried to dig a bit deeper into drbd and made a couple of settings
there.

e.g.
* trying different sync-rate (in static mode)
* trying dynamic syncer with several min and max limits

When enabling a ganeti specific option --prealloc-disk-wipe the behaviour
recovered for a short time but speeds dropped again :/

I refocused the network, and set up MTU 9000 and txqueuelen 10000 for the
bonding interface which again fixed it for a few minutes just to get bad
again.

I've tried to set several tcp related sysctl options such as
net.core.netdev_max_backlog and net.ipv4.tcp_mtu_probing.

It all ends in syncing speed with max 100 mbit/s.

My problem isn't the initial creation of VMs (initial sync could be put
into background so the vm can already install while the sync is in
progress). The main point ist that the slow sync is acts as a bottleneck
for disk actions on the vms themself.

That gives me technically a limit of 100 Mbit/s virtual drive speed
(especially when handling with big files). I could use drbd Protocol A to
prevent this but data integrity in case of failures is more important to me.

Actually I'm running out of ideas what else to try.

Anyone an idea where I forgot to look - any suggestions?

Best regards and thanks in advance
Max
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150223/0938bb2f/attachment.htm>