[DRBD-user] blocking I/O with drbd

Fri Dec 16 13:27:55 CET 2011

Hi,

>> [root at nfs01 nfs]# cat /proc/drbd
>> version: 8.3.8 (api:88/proto:86-94)
> 
> Really do an upgrade! ... elrepo seems to have latest DRBD 8.3.12 packages

Thanks for the hint, we might consider that if nothing else helps :-)

Not that we dont want the newer version. Its the unofficial repository
that is the problem here. We are quite hesitant of unofficial repos,
because that systems hosts hundreds of customers.

>> Why these resyncs happen and so much data is being resynced, is another
>> case. The nodes were disconnected for 3-4 Minutes which does not justify
>> so much data. Anyways...
> 
> If you adjust your resource after changing an disk option the disk is
> detached/attached ... this means syncing the complete AL when done on a
> primary ... 3833*4MB=15332MB

Great! Thanks for the insight. Im really learning some stuff about drbd
here!

>> After issueing the mentioned dd command
>>
>> $ dd if=/dev/zero of=./test-data.dd bs=4096 count=10240
>> 10240+0 records in
>> 10240+0 records out
>> 41943040 bytes (42 MB) copied, 0.11743 seconds, 357 MB/s
> 
> you benchmark your page cache here ... add oflag=direct to dd to bypass it

Now this makes me shiver and lough at the same time (shortened the output):

####
[root at nfs01 nfs]# dd if=/dev/zero of=./test-data.dd bs=4096 count=10240
41943040 bytes (42 MB) copied, 24.7257 seconds, 1.7 MB/s

[root at nfs01 nfs]# dd if=/dev/zero of=./test-data.dd bs=4096 count=10240
oflag=direct
41943040 bytes (42 MB) copied, 25.9601 seconds, 1.6 MB/s

[root at nfs01 nfs]# dd if=/dev/zero of=./test-data.dd bs=4096 count=10240
oflag=direct
41943040 bytes (42 MB) copied, 44.4078 seconds, 944 kB/s

[root at nfs01 nfs]# dd if=/dev/zero of=./test-data.dd bs=4096 count=10240
oflag=direct
30384128 bytes (42 MB) copied, 26.9182 seconds, 1.3 MB/s
####

The load rises a little while doing this (to about 3-4), but the systems
remains usable.

> looks like I/O system or network is fully saturated

It seems more like some sort of drbd-cache-setting is broken somewhere.

On an LVM-Volume without DRBD dd works fine (i shortened the output):

####
[root at nfs01 mnt]# dd if=/dev/zero of=./test-data.dd bs=4096 count=10240
oflag=direct
41943040 bytes (42 MB) copied, 0.738741 seconds, 56.8 MB/s

[root at nfs01 mnt]# dd if=/dev/zero of=./test-data.dd bs=4096 count=10240
oflag=direct
41943040 bytes (42 MB) copied, 0.746778 seconds, 56.2 MB/s

[root at nfs01 mnt]# dd if=/dev/zero of=./test-data.dd bs=4096 count=10240
oflag=direct
41943040 bytes (42 MB) copied, 0.733518 seconds, 57.2 MB/s

[root at nfs01 mnt]# dd if=/dev/zero of=./test-data.dd bs=4096 count=10240
oflag=direct
41943040 bytes (42 MB) copied, 0.736617 seconds, 56.9 MB/s

[root at nfs01 mnt]# dd if=/dev/zero of=./test-data.dd bs=4096 count=10240
oflag=direct
41943040 bytes (42 MB) copied, 0.73078 seconds, 57.4 MB/s
####

The network link is also just fine. We've tested this with almost
100MB/s (that is Megabytes) of throughput. The only possible limit here
would be the syncer rate of 25MB/s, but the network-link is only
saturated during a resync.

Any more ideas with this info?

best regards
volker