Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello. I'm new on drbd-user@, but long-time user of kvm. [A side note: drbd-user@ appears to be subscribers-only list. Sad consequence of spammers activity... I doubt many knowlegeable people on kvm@ are subscribed to drbd-user@ to be able to post there. Oh well.] Today I tried drbd for the first time, and decided to start experimenting on two virtual machines. And immediately ran into interesting issues, described below. The way to the results wasn't easy, trust me ;) Software used: DRBD version 8.3.1 (8.3.2rc1 spewed numerous assertion failures and crashed both virtual and non-virtual machines). kernel 2.6.29.5 compiled for x86 (i686smp) lvm kernel module from 2.6.29 kvm userspace qemu-kvm-0.10.5. Both virtual machines were identical, using virtio_net (bridged) and virtio_blk (on raw files on ext4 fs). When I first set drbd up, waited till it did initial resync, formatted (ext3) and mounted the filesystem on /dev/drbd1. Next I did a trivial test - copying /usr (tiny 140Mb) to the drbd filesystem. This took........ 29 *minutes* (!!!). Which is about 80 kilobytes per sec. Reminds me ol'good dialup modem... ;) At this point I tried many different things, including straceing kvm process on host. The thing is that both virtio_net&virtio_blk works quite well, copying that same /usr between two virtual machines over nfs takes less than a minute. And finally, after finding the key knobs, here's the results: /sys/block/vda/queue/rotational and ../elevator on drbd *secondary* node are the knobs in question. First of all, for some reason virtil_blk sets its 'rotational' attribute to 0. Which was a surprize to me - both to discover this flag for virtio_blk (i understand it's present on all block devices) and to see its default value being set to 0. (A side note: the flag is probably misnamed, it was meant to show how costly disk seeks are: rotational disks vs flash media where seeks are not present at all). So, with both primary and secondary virtual disks on real hdd and with default rotational=0 and (my default) elevator=noop, the copy takes 29 minutes. By changing the two mentioned knobs on PRIMARY I was able to reduce that to 27 minutes. Now, with rotational=1 and elevator=noop on SECONDARY the whole thing completed in 2m40s. Which is about 10 times a difference!. Next, with rotational=1 and elevator=cfq on SECONDARY, it completes in about 1m10s sec. With rotational=1 and elevator=cfq on BOTH nodes it completes in 50 sec. (for the record, elevator=cfq and rotational=0 on SECONDARY makes it complete in about 9 minutes) I went further and placed one of the virtual disks into memory on host. And even there, while speed increased dramatically, rotational on SECONDARY (even if the disk on secondary is in host's memory) had about 10x effect. On the other hand, placing virtual disk on PRIMARY to memory changes very little. On the host side, by straceing kvm process it may be noticed that with default rotational=0, there's no write block mergeing going on at all. All writes on host are done in 4096-byte-sized chunks (block size of ext3fs). But after setting rotational to 1 I see writes being combined -- many 32Kb writes, 64Kb writes etc. On primary node there's almost no change in write pattern on the host, writes are being combined regardless of `rotational' flag. Another observation on kvm side - when I/O activity starts, kvm spawns many threads to perform the I/O. This is, again, especially visible on the secondary node. It writes adjanced 4096-byte blocks in different threads, about 10..20 threads at any given time. I *guess* it's ncq/tcq-like thing implemented by kvm, but it looks utterly ineffecient - having in mind sequential block numbers. So finally some questions. drbd: what's the difference in write pattern on secondary and primary nodes? Why `rotational' flag makes very big difference on secondary and no difference whatsoever on primary? kvm: what is this 'rotational' flag, what's its meaning here? I guess it should be controllable from the command line if all else fails. In any way it's definitely something to keep in mind. I guess more easily triggerable workload will show something similar (without drbd). kvm: i/o threads - should there be a way to control the amount of threads? With default workload generated by drbd on secondary node having less thread makes more sense. kvm: it has been said that using noop elevator on guest makes sense since host does its own elevator/reordering. But this example shows "nicely" that this isn't always the case. I wonder how "general" this example is. Will try to measure further. Another "interesting" issue shown by drbd here: on the slow case, secondary is so much busy with the writes that it misses drbd pings and pongs so that nodes frequently considers the other peer as dead and reconnects/resyncs. This is not on a pre-stone-age machine and with not slowest disks (it's 2.5GHz AMD Athlon X2-64 CPU and ~90Mb/sec sata disk). I didn't try other combinations yet, including non-virtio model. Because all the above took almost whole day today, especially finding the knobs :) Comments? Thanks! /mjt