[DRBD-user] kvm, drbd, elevator, rotational - quite an interesting co-operation

Fri Jul 3 10:45:03 CEST 2009

On Thu, Jul 02, 2009 at 11:55:05PM +0400, Michael Tokarev wrote:
> Hello.
> 
> I'm new on drbd-user@, but long-time user of kvm.
> 
> [A side note: drbd-user@ appears to be subscribers-only
>   list.  Sad consequence of spammers activity... I doubt
>   many knowlegeable people on kvm@ are subscribed to
>   drbd-user@ to be able to post there.  Oh well.]

I'm usually near-realtime letting relevant non-subscriber posts through, 
so no worries.

> Today I tried drbd for the first time, and decided to start
> experimenting on two virtual machines.
> 
> And immediately ran into interesting issues, described below.
> The way to the results wasn't easy, trust me ;)
> 
> Software used:
> 
>   DRBD version 8.3.1 (8.3.2rc1 spewed numerous assertion
>    failures and crashed both virtual and non-virtual machines).

did that improve?
we are about to release 8.3.2 now,
so feedback on those issues would have been nice.

>   kernel 2.6.29.5 compiled for x86 (i686smp)
>   lvm kernel module from 2.6.29
>   kvm userspace qemu-kvm-0.10.5.
> 
> Both virtual machines were identical, using virtio_net (bridged)
> and virtio_blk (on raw files on ext4 fs).
> 
> When I first set drbd up, waited till it did initial resync,
> formatted (ext3) and mounted the filesystem on /dev/drbd1.
> Next I did a trivial test - copying /usr (tiny 140Mb) to
> the drbd filesystem.
> 
> This took........  29 *minutes* (!!!).  Which is about 80
> kilobytes per sec.  Reminds me ol'good dialup modem... ;)
> 
> At this point I tried many different things, including
> straceing kvm process on host.  The thing is that both
> virtio_net&virtio_blk works quite well, copying that same
> /usr between two virtual machines over nfs takes less
> than a minute.  And finally, after finding the key knobs,
> here's the results:
> 
> /sys/block/vda/queue/rotational and ../elevator on drbd
> *secondary* node are the knobs in question.
> 
> First of all, for some reason virtil_blk sets its 'rotational'
> attribute to 0.  Which was a surprize to me - both to discover
> this flag for virtio_blk (i understand it's present on all
> block devices) and to see its default value being set to 0.
> (A side note: the flag is probably misnamed, it was meant
> to show how costly disk seeks are: rotational disks vs
> flash media where seeks are not present at all).
> 
> So, with both primary and secondary virtual disks on real
> hdd and with default rotational=0 and (my default) elevator=noop,
> the copy takes 29 minutes.
> 
> By changing the two mentioned knobs on PRIMARY I was able to
> reduce that to 27 minutes.
> 
> Now, with rotational=1 and elevator=noop on SECONDARY the
> whole thing completed in 2m40s.  Which is about 10 times
> a difference!.
> 
> Next, with rotational=1 and elevator=cfq on SECONDARY, it
> completes in about 1m10s sec.
> 
> With rotational=1 and elevator=cfq on BOTH nodes it completes
> in 50 sec.
> 
> (for the record, elevator=cfq and rotational=0 on SECONDARY makes
> it complete in about 9 minutes)
> 
> I went further and placed one of the virtual disks into memory
> on host.  And even there, while speed increased dramatically,
> rotational on SECONDARY (even if the disk on secondary is in
> host's memory) had about 10x effect.  On the other hand, placing
> virtual disk on PRIMARY to memory changes very little.
> 
> On the host side, by straceing kvm process it may be noticed that
> with default rotational=0, there's no write block mergeing going
> on at all.  All writes on host are done in 4096-byte-sized chunks
> (block size of ext3fs).  But after setting rotational to 1 I see
> writes being combined -- many 32Kb writes, 64Kb writes etc.
> 
> On primary node there's almost no change in write pattern on the
> host, writes are being combined regardless of `rotational' flag.
> 
> Another observation on kvm side - when I/O activity starts, kvm
> spawns many threads to perform the I/O.  This is, again, especially
> visible on the secondary node.  It writes adjanced 4096-byte
> blocks in different threads, about 10..20 threads at any given
> time.  I *guess* it's ncq/tcq-like thing implemented by kvm, but
> it looks utterly ineffecient - having in mind sequential block
> numbers.
> 
> So finally some questions.

Interessting findings.

What IO scheduler is used on the HOST?
did you try to leave the guests in their default setting,
(rotational=0, elevator=noop)
and use the _deadline_ scheduler in the host?

> drbd: what's the difference in write pattern on secondary and
>   primary nodes?  Why `rotational' flag makes very big difference
>   on secondary and no difference whatsoever on primary?

not much.
disk IO on Primary is usually submitted in the context of the
submitter (vm subsystem, filesystem or the process itself)
whereas on Secondary, IO is naturally submitted just by the
DRBD receiver and worker threads.

> kvm: what is this 'rotational' flag, what's its meaning here?
>   I guess it should be controllable from the command line if
>   all else fails.  In any way it's definitely something to keep
>   in mind.  I guess more easily triggerable workload will show
>   something similar (without drbd).
> 
> kvm: i/o threads - should there be a way to control the amount of
>   threads?  With default workload generated by drbd on secondary
>   node having less thread makes more sense.
> 
> kvm: it has been said that using noop elevator on guest makes sense
>   since host does its own elevator/reordering.  But this example
>   shows "nicely" that this isn't always the case.  I wonder how
>   "general" this example is.  Will try to measure further.
> 
> Another "interesting" issue shown by drbd here: on the slow case,
> secondary is so much busy with the writes that it misses drbd
> pings and pongs so that nodes frequently considers the other
> peer as dead and reconnects/resyncs.

that should not happen, these "DRBD pings" are answered by the
"drbd_asender" thread ("a" for "acknowledgement"), which is not involved
in submitting/waiting for the IO, and should be mostly idle, if only few
requests are completed per unit time.

there are two timeouts here in DRBD: the timeout (default 6 seconds,
iirc), and ping-timeout (default 0.5 seconds, iirc).
is it possible, that 500 ms (rtt!) is just too short for the kvm briged
network stack, or the drbd_asender is not scheduled in time?

> This is not on a pre-stone-age machine and with not slowest disks
> (it's 2.5GHz AMD Athlon X2-64 CPU and ~90Mb/sec sata disk).
> 
> I didn't try other combinations yet, including non-virtio model.
> Because all the above took almost whole day today, especially
> finding the knobs :)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed