[DRBD-user] High I/O Wait when cluster is connected (and reduced throughput)

Fri Mar 15 08:32:17 CET 2013

Hello list,

We are currently investigating a performance issue with our DRBD cluster. We
have recently upgraded the hardware in the cluster (added new and additional
raid cards, SSD caching and a lot of spindles) and experiencing the
following problem:

When our secondary node is connected to the cluster this leads to a
dramatical performance drop in terms of time the primary spends in i/o wait
and a dramatically reduced throughput to the backing devices.

Before upgrading our cluster hardware we just thought that high i/o wait is
due to our backing devices hitting the ground but that can't be the reason
according to my tests.

We currently have 4 raid sets on each cluster node as backing devices
connected to two LSI 9280 raid cards (in each cluster node) with BBWC and
FastPath and CacheCade option. CacheCade is turned on for this tests and the
CacheCade disks are Intel 520 SSD running in raid 1. Each controller has a
ssd cache like this.

Even if there is abolutely no load on the cluster we obverse the following:

When doing a single sequential write to the primary node to one of the raid
sets the i/o wait of the cluster node rises to about 40-50 % of the cpu time
where the secondary is sitting IDLE (i/o wait 0.2 percent on here). Notice
these are 8 core machines so 50% i/o wait means 4 cores or 4 threads waiting
for the block device all the time. Throughput drops from ~ 450Mbyte/s to 200
Mbyte/s opposed to the situation where we take down the corresponding drbd
device on the secondary.

If the drbd device is running in stand alone mode on the primary the i/o
wait is as low as ~ 10 - 15 percent which I assume is normal behaviour when
one sequential write is hitting a block device at max rate. We first thought
that this might be an issue with our scst configuration but this also
happens if we do a test locally on the cluster node.

The cluster nodes are connected with a 10GBE link in a back-2-back fashion.
The measured RTT is about 100us and the measured TCP bandwidth is 9.6 Gbps.
As I said the backing device on the secondary is just sitting there bored-
the i/o wait on the secondary is about 0.2 percent. We already rose the
al-extends parameter as I read that it could be a cause for performance
issues if drbd has to do frequent meta data updates.

This is the current config of one of our drbd- devices:

resource XXX {

  device      /dev/drbd3;
  disk        /dev/disk/by-id/scsi-3600605b00494046018bf37719f7e1786-part1;
  meta-disk   internal;

  net {
    max-buffers     16384;
    unplug-watermark 32;
    max-epoch-size  16384;
    sndbuf-size 1024k;
  }

  syncer {
    rate 200M;
    al-extents 3833;
  }

  disk {
    no-disk-barrier;
    no-disk-flushes;
  }

  on node-a {
    address     10.1.200.1:7792;
  }

  on node-b {
    address    10.1.200.2:7792;
  }
}

Here is a screenshot showing a single sequential write to the device (the
primary is on the left): http://imageshack.us/f/823/drbd.png/. By the way
can anyone possibly elaborate why the tps is so much higher on the
secondary?

There are 16GB mem in each node and this device whe are talking about is a
12 spindle raid 6 with ssd caching (read and write). We aleady tried
disabling ssd cache but this is even getting worse. Although the cluster is
STILL in good responsiveness most oft he time this is getting an issue for
us as this seems to impact performance on all devices on the cluster so i/o
service time of ALL devices rises to about 300ms when e.g. a storage vmotion
is done on one oft he devices so putting more disks to the cluster does not
help us to reasonably improve the performance in the moment.

BTW this happens on ALL drbd devices in this cluster with different raid
sets and different disks and so on.

Thanks to everyone in advance,

regards, Felix