Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
It seems I MUST be doing something absolutely idiotic. but I'm darned if I
know what.
I've been running some very crude performance tests of drbd on some of our
centos 5.1 kernels, since I got a couple rude shocks ...
The short of the story is that the 8.0 and 8.2 modules are SLOW, and I was
getting crashes that seemed to implicate the 0.7.x module with .7
on stock centos kernels (so I can't use that).
There clearly seems to be something amiss, as my tests seem to
show 8.x as being slower in standalone mode than it is when
connected ?!
gt7 -> gt2 testing... (i386 -> x86_64, centos kernels on both ends)
the primary/source node with the mounted filesystem is current centos 5.1
box...
The command I was using for rough estimate of write throughput
was:
sync; time sh -c "dd if=/dev/zero of=BIG bs=4096 count=128000; sync"
and then record the "real" time (e.g. wall clock). Basically I'm
timing how long it takes to dump almost a gig of data to disk...
The backing storage on both ends is a raid-0 stripe, which should
be capable of handling something close to 120 MByte/s, and with
0.7.25 in standalone we do get aprox 100 MByte/s and 68 MByte/s
over the wire... which is fine, but a home compiled 0.7 was implicated
in kernel crashes (centos stock xen-kernel) after only a few hours
running... and the 8.x modules seem to cause a performance problem...
here's the rough numbers... drbd versions, and network
protocols... All for the dump a gig to the target (through an
ext3 filesystem...)
drbd version: 8.2.5
A B C standalone
27.034 26.360 27.981 32.110
27.106 27.504 27.486 30.797
28.251 26.750 27.648 30.974
26.049
26.772
drbd version: 0.7.25
A B C standalone
15.426 15.545 15.697 10.872
15.649 15.511 15.055 10.681
15.735 15.334 15.153 10.157
15.407
drbd version: 8.0.12
A B C standalone
28.489 27.956 26.684 30.695
28.166 27.382 27.465 30.480
27.105 29.861
ok, lets try replacing the local (raid0) lvm with a raw disk
8.0.12
A B C standalone BareMetal
40.784 45.999 18.106
41.078 46.058 17.052
17.030
So, it doesn't seem to be some strange interaction with LVM, and
the node used for testing in this case is NOT the xen enabled one
that I originally had the trouble with...
Anyone have any ideas? Why would stand alone be worse than
connected? I used the exact same config files for the various
different modules... just changing the protocol versions, but
that didn't make much difference...
Obviously there are tweaks that can be done (moving the metadata
around etc) to speed things up, but why would the 8.x versions
be SOO much slower than the 0.7.x versions ?
root at gt7 mnt]# uname -a
Linux gt7.baremetal.com 2.6.18-53.1.19.el5 #1 SMP Wed May 7
08:20:19 EDT 2008 i686 i686 i386 GNU/Linux
[root at gt7 mnt]# more /etc/drbd.conf
global {
minor-count 5;
}
resource drbd-test {
protocol C;
on gt7.baremetal.com {
device /dev/drbd0;
#disk /dev/vg1/drbd-test;
disk /dev/sdc;
address 192.168.90.17:7790;
meta-disk internal;
}
on gt2.baremetal.com {
device /dev/drbd0;
disk /dev/vg1/drbd-test;
address 192.168.90.15:7790;
meta-disk internal;
}
disk {
on-io-error detach;
}
syncer {
rate 40M;
al-extents 257;
}
startup {
degr-wfc-timeout 120;
}
}
Help/thoughts? If I hadn't had very good experiences with 0.7.x
I'd just throw up my hands and walk away, but I've been running
0.7.23 on at least one node for ages and it's been great.
Oh, this is sort of funny, I got suspicious that my transfer
speeds were close to that Sync rate, so I cranked the sync rate
up to 100M, it didn't change the filesystem write performance,
but it when I disconnected and then reconnected the secondary and
primary, I was getting aproximately 90MByte/s being written to
the disk of the secondary... which reminds, these nodes are
connected via Gigabit ethernet switch (cisco WS-C2960G-24TC-L).
hhm, doesn't seem to be an ext3 issue either...
[root at gt7 /]# sync; time sh -c "dd if=/dev/zero of=/dev/drbd0
bs=4096 count=256000; sync"
real 0m25.849s
(8.0.12 protoC, so ext3 would have been aprox 27 seconds, still
nowhere near as fast as the 15 seconds of 0.7.25)
oh, except that if you run it twice, you get radically different
results...
real 0m13.537s
real 0m13.721s
...
-Tom