[DRBD-user] kvm, drbd, elevator, rotational - quite an interesting co-operation

Thu Jul 2 21:55:05 CEST 2009

Hello.

I'm new on drbd-user@, but long-time user of kvm.

[A side note: drbd-user@ appears to be subscribers-only
  list.  Sad consequence of spammers activity... I doubt
  many knowlegeable people on kvm@ are subscribed to
  drbd-user@ to be able to post there.  Oh well.]

Today I tried drbd for the first time, and decided to start
experimenting on two virtual machines.

And immediately ran into interesting issues, described below.
The way to the results wasn't easy, trust me ;)

Software used:

  DRBD version 8.3.1 (8.3.2rc1 spewed numerous assertion
   failures and crashed both virtual and non-virtual machines).
  kernel 2.6.29.5 compiled for x86 (i686smp)
  lvm kernel module from 2.6.29
  kvm userspace qemu-kvm-0.10.5.

Both virtual machines were identical, using virtio_net (bridged)
and virtio_blk (on raw files on ext4 fs).

When I first set drbd up, waited till it did initial resync,
formatted (ext3) and mounted the filesystem on /dev/drbd1.
Next I did a trivial test - copying /usr (tiny 140Mb) to
the drbd filesystem.

This took........  29 *minutes* (!!!).  Which is about 80
kilobytes per sec.  Reminds me ol'good dialup modem... ;)

At this point I tried many different things, including
straceing kvm process on host.  The thing is that both
virtio_net&virtio_blk works quite well, copying that same
/usr between two virtual machines over nfs takes less
than a minute.  And finally, after finding the key knobs,
here's the results:

/sys/block/vda/queue/rotational and ../elevator on drbd
*secondary* node are the knobs in question.

First of all, for some reason virtil_blk sets its 'rotational'
attribute to 0.  Which was a surprize to me - both to discover
this flag for virtio_blk (i understand it's present on all
block devices) and to see its default value being set to 0.
(A side note: the flag is probably misnamed, it was meant
to show how costly disk seeks are: rotational disks vs
flash media where seeks are not present at all).

So, with both primary and secondary virtual disks on real
hdd and with default rotational=0 and (my default) elevator=noop,
the copy takes 29 minutes.

By changing the two mentioned knobs on PRIMARY I was able to
reduce that to 27 minutes.

Now, with rotational=1 and elevator=noop on SECONDARY the
whole thing completed in 2m40s.  Which is about 10 times
a difference!.

Next, with rotational=1 and elevator=cfq on SECONDARY, it
completes in about 1m10s sec.

With rotational=1 and elevator=cfq on BOTH nodes it completes
in 50 sec.

(for the record, elevator=cfq and rotational=0 on SECONDARY makes
it complete in about 9 minutes)

I went further and placed one of the virtual disks into memory
on host.  And even there, while speed increased dramatically,
rotational on SECONDARY (even if the disk on secondary is in
host's memory) had about 10x effect.  On the other hand, placing
virtual disk on PRIMARY to memory changes very little.

On the host side, by straceing kvm process it may be noticed that
with default rotational=0, there's no write block mergeing going
on at all.  All writes on host are done in 4096-byte-sized chunks
(block size of ext3fs).  But after setting rotational to 1 I see
writes being combined -- many 32Kb writes, 64Kb writes etc.

On primary node there's almost no change in write pattern on the
host, writes are being combined regardless of `rotational' flag.

Another observation on kvm side - when I/O activity starts, kvm
spawns many threads to perform the I/O.  This is, again, especially
visible on the secondary node.  It writes adjanced 4096-byte
blocks in different threads, about 10..20 threads at any given
time.  I *guess* it's ncq/tcq-like thing implemented by kvm, but
it looks utterly ineffecient - having in mind sequential block
numbers.

So finally some questions.

drbd: what's the difference in write pattern on secondary and
  primary nodes?  Why `rotational' flag makes very big difference
  on secondary and no difference whatsoever on primary?

kvm: what is this 'rotational' flag, what's its meaning here?
  I guess it should be controllable from the command line if
  all else fails.  In any way it's definitely something to keep
  in mind.  I guess more easily triggerable workload will show
  something similar (without drbd).

kvm: i/o threads - should there be a way to control the amount of
  threads?  With default workload generated by drbd on secondary
  node having less thread makes more sense.

kvm: it has been said that using noop elevator on guest makes sense
  since host does its own elevator/reordering.  But this example
  shows "nicely" that this isn't always the case.  I wonder how
  "general" this example is.  Will try to measure further.

Another "interesting" issue shown by drbd here: on the slow case,
secondary is so much busy with the writes that it misses drbd
pings and pongs so that nodes frequently considers the other
peer as dead and reconnects/resyncs.  This is not on a pre-stone-age
machine and with not slowest disks (it's 2.5GHz AMD Athlon X2-64
CPU and ~90Mb/sec sata disk).

I didn't try other combinations yet, including non-virtio model.
Because all the above took almost whole day today, especially
finding the knobs :)

Comments?

Thanks!

/mjt