[DRBD-user] DRBD sync speed much slower than expected

Tue Feb 24 10:35:35 CET 2015

If bonding between the 2 nodes is active/pasive. 

You network speed is only 1Gb/s = aprox 120MB/s

De: drbd-user-bounces at lists.linbit.com [mailto:drbd-user-bounces at lists.linbit.com] En nombre de Max Weissbach
Enviado el: lunes, 23 de febrero de 2015 17:05
Para: drbd-user at lists.linbit.com
Asunto: [DRBD-user] DRBD sync speed much slower than expected

Hi,

I'm using drbd (8.3.13) in a ganeti 2.11.6 environment.

The system is set up as follows:
* 3 physical machines, 2x 6 core xeon + 96GB RAM each node
(all hosts running Debian Wheezy)
* 4 SSD @ RAID 10 via hardware raid controller
* 500 GByte of this volume as an lvm vg for ganeti (as required for ganeti)
* network between the nodes 2x 1Gbit using port bonding (active/slave failover, bond mode 0)
* network speed, though is limited to 500 Mbit/s by the DC (money and stuff -.-*)

My problem is the following. The first, lets say, 2 weeks of setting up the ganeti cluster (and filling it with kvm VMs), everything used to be fine.
DRBD Volumes were syncing at top speeds as seen in iftop monitoring the bonding device, which means 400-450 Mbits - no complains.
Then suddenly every installation took much longer than it has recently. Syncing speeds never raised 130 Mbits from that moment on.

I started searching at the bottom, so network and physical disks. Network had the desired capacity all the time, with scp running 3 or 4 times faster than the drbd sync
(even WHILE having a sync in progress). iperf measurements from one node to another confirmed that:

#################

Client connecting to 10.46.0.2, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------

------------------------------
[  3] local 10.46.0.3 port 47716 connected with 10.46.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   520 MBytes   435 Mbits/sec

#################

The disks are okay too, could not imagine a raid10 ssd raid hitting the break to 100 Mbit/s anyways, dd to be sure:

#################

# dd if=/dev/zero of=/tmp/foobar bs=1k count=1000000
1000000+0 Datensätze ein
1000000+0 Datensätze aus
1024000000 Bytes (1,0 GB) kopiert, 1,96102 s, 522 MB/s

#################

I've then tried to dig a bit deeper into drbd and made a couple of settings there.

e.g.
* trying different sync-rate (in static mode)
* trying dynamic syncer with several min and max limits

When enabling a ganeti specific option --prealloc-disk-wipe the behaviour recovered for a short time but speeds dropped again :/

I refocused the network, and set up MTU 9000 and txqueuelen 10000 for the bonding interface which again fixed it for a few minutes just to get bad again.

I've tried to set several tcp related sysctl options such as net.core.netdev_max_backlog and net.ipv4.tcp_mtu_probing.

It all ends in syncing speed with max 100 mbit/s.

My problem isn't the initial creation of VMs (initial sync could be put into background so the vm can already install while the sync is in progress). The main point ist that the slow sync is acts as a bottleneck for disk actions on the vms themself.

That gives me technically a limit of 100 Mbit/s virtual drive speed (especially when handling with big files). I could use drbd Protocol A to prevent this but data integrity in case of failures is more important to me.

Actually I'm running out of ideas what else to try.

Anyone an idea where I forgot to look - any suggestions?

Best regards and thanks in advance
Max

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150224/4c2c2437/attachment.htm>