[DRBD-user] High I/O Wait when cluster is connected (and reduced throughput)

Sat Mar 16 17:38:34 CET 2013

Hello again!

I did some further testing and performance tuning. I was able to rise the
average throughput a bit with large max-buffers and changing the i/o
scheduler but the drbd throughput is still way behind the max throughput of
the raw devices, while i/o wait is still going through the roof.

I think this seems to be related with the 4k writes on the secondary.

While drbd accumulates writes on the primary to larger block i/o to the raw
disk it does not on the secondary. All i/o to the raw disk on the secondary
is 8 block or 4kbytes large. That is hitting the storage on the secondary
with up to 100000 4k i/os per second and it seems that the raid card limit
is kicking in somewhere around this.

Doing additional testing with iostat induces that the disk utilisation on
the secondary is up to 99% when doing large bulk writes on the primary where
disk utilisation is only about 60% with the same amount of data being
written at the same time.

When drbd does a sync or resync it transmits larger blocks to the secondary
and I am getting up to 600Mbyte/s write speed on the secondary (which sounds
quite okay for a backing device that is able to put somehwere aroung
700mbyte/s). Swapping the roles for a test device did not do anything- the
situation is just the other way round.

All i/o on the secondary is 4kbytes large during ongoing replication and
quite larger during sync.

This is how it ALWAYS looks like during ongoing replication. (Notice the
avgrq-sz of 8 blocks):

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00   18.50     0.00     0.07     8.00
0.00    0.11    0.00    0.11   0.11   0.20

Interestingly the i/o wait of the secondary is not rising a bit (not over 1%
in any situation) but on the primary a lot (as I said up to 60% of 8 cores
when doing large sequential operation)- although the disks are nearly fully
utilized when doing large sequential writes (seems drbd has some kind of
throtteling mechanism here?).

So I built a test setup using two virtual machines and built drbd 8.4.3 from
source for this, running on kernel 3.2.35 too.

Interestingly in my test setup the situation is different. Here block i/o on
the secondary raw device has different sizes, not only 8k. Of course I can't
do any serious performance benchmark on two virtual machines.

Can anyone confirm this behaviour?

Can anyone elaborate why

1. drbd splits up all i/o into 4k i/o (possibly related with the intention
to put any i/o into a single jumbo frame?)
2. in drbd 8.3.11 (at least on my hardware) all block i/o hits the secondary
backend device with 4k block size?
3. drbd 8.4.3 behaviour seems to be different (is it actually different or
just dependent on the different (virtual) hw?)

As far as I found out googling drbd should normally determine the maximum
block i/o size from the underlying hw
/sys/block/<device>/queue/max_hw_sectors_kb

But my hw is actually showing a value of 512 for that which indicates it
will accept block i/o up to a size of 512kb - if I am right.

The drbd devices show a value of 128k for that which seems to do actually
nothing as any write request to the drbd device is issued in a 4k manner.
(This is the same behaviour on my 8.4 test setup but here the writes seem to
be accumulated on the secondary just like they are on the primary)

Thank you all in advance,

regards, Felix

> -----Ursprüngliche Nachricht-----
> Von: drbd-user-bounces at lists.linbit.com [mailto:drbd-user-
> bounces at lists.linbit.com] Im Auftrag von Felix Zachlod
> Gesendet: Freitag, 15. März 2013 08:32
> An: drbd-user at lists.linbit.com
> Betreff: [DRBD-user] High I/O Wait when cluster is connected (and reduced
> throughput)
> 
> Hello list,
> 
> We are currently investigating a performance issue with our DRBD cluster.
> We have recently upgraded the hardware in the cluster (added new and
> additional raid cards, SSD caching and a lot of spindles) and experiencing
the
> following problem:
> 
> When our secondary node is connected to the cluster this leads to a
> dramatical performance drop in terms of time the primary spends in i/o
wait
> and a dramatically reduced throughput to the backing devices.
> 
> Before upgrading our cluster hardware we just thought that high i/o wait
is
> due to our backing devices hitting the ground but that can't be the reason
> according to my tests.
> 
> We currently have 4 raid sets on each cluster node as backing devices
> connected to two LSI 9280 raid cards (in each cluster node) with BBWC and
> FastPath and CacheCade option. CacheCade is turned on for this tests and
> the CacheCade disks are Intel 520 SSD running in raid 1. Each controller
has a
> ssd cache like this.
> 
> Even if there is abolutely no load on the cluster we obverse the
following:
> 
> When doing a single sequential write to the primary node to one of the
raid
> sets the i/o wait of the cluster node rises to about 40-50 % of the cpu
time
> where the secondary is sitting IDLE (i/o wait 0.2 percent on here). Notice
> these are 8 core machines so 50% i/o wait means 4 cores or 4 threads
waiting
> for the block device all the time. Throughput drops from ~ 450Mbyte/s to
200
> Mbyte/s opposed to the situation where we take down the corresponding
> drbd device on the secondary.
> 
> If the drbd device is running in stand alone mode on the primary the i/o
wait
> is as low as ~ 10 - 15 percent which I assume is normal behaviour when one
> sequential write is hitting a block device at max rate. We first thought
that
> this might be an issue with our scst configuration but this also happens
if we
> do a test locally on the cluster node.
> 
> The cluster nodes are connected with a 10GBE link in a back-2-back
fashion.
> The measured RTT is about 100us and the measured TCP bandwidth is 9.6
> Gbps.
> As I said the backing device on the secondary is just sitting there bored-
the
> i/o wait on the secondary is about 0.2 percent. We already rose the al-
> extends parameter as I read that it could be a cause for performance
issues if
> drbd has to do frequent meta data updates.
> 
> This is the current config of one of our drbd- devices:
> 
> resource XXX {
> 
>   device      /dev/drbd3;
>   disk
/dev/disk/by-id/scsi-3600605b00494046018bf37719f7e1786-part1;
>   meta-disk   internal;
> 
>   net {
>     max-buffers     16384;
>     unplug-watermark 32;
>     max-epoch-size  16384;
>     sndbuf-size 1024k;
>   }
> 
>   syncer {
>     rate 200M;
>     al-extents 3833;
>   }
> 
>   disk {
>     no-disk-barrier;
>     no-disk-flushes;
>   }
> 
>   on node-a {
>     address     10.1.200.1:7792;
>   }
> 
>   on node-b {
>     address    10.1.200.2:7792;
>   }
> }
> 
> Here is a screenshot showing a single sequential write to the device (the
> primary is on the left): http://imageshack.us/f/823/drbd.png/. By the way
> can anyone possibly elaborate why the tps is so much higher on the
> secondary?
> 
> There are 16GB mem in each node and this device whe are talking about is a
> 12 spindle raid 6 with ssd caching (read and write). We aleady tried
disabling
> ssd cache but this is even getting worse. Although the cluster is STILL in
good
> responsiveness most oft he time this is getting an issue for us as this
seems
> to impact performance on all devices on the cluster so i/o service time of
ALL
> devices rises to about 300ms when e.g. a storage vmotion is done on one
oft
> he devices so putting more disks to the cluster does not help us to
> reasonably improve the performance in the moment.
> 
> BTW this happens on ALL drbd devices in this cluster with different raid
sets
> and different disks and so on.
> 
> Thanks to everyone in advance,
> 
> regards, Felix
> 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user