[DRBD-user] Very poor performance

Wed Aug 29 08:48:09 CEST 2012

On 24/08/12 06:09, Arnold Krille wrote:
> Hi,
>
> On Friday 24 August 2012 01:56:37 Adam Goryachev wrote:
>> I have a pair of DRBD machines, and I'm getting very poor performance
>> from the DRBD when the second server is connected. I've been working on
>> resolving the performance issues for a few months, with not much luck.
>> Here is some info on the current configuration:
>>
>> Both machines are identical (apart from drives):
>> CPU Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz
>> RAM 8G
>>
>> One machine (the primary) has Intel 480G SSD drives in RAID5
>> The second machine (backup) has 2 x WD Caviar Black 2TB HDD (in RAID1,
>> limited to 960GB)
>>
>> I am using all software from Debian Stable with all updates/security
>> updates installed:
>> drbd8-utils 2:8.3.7-2.1
> I made good experience with compiling latest 8.3 from git and creating debs. 
> Less bugs, more performance on debian stable.
I'm not comfortable at the moment to do this, but hopefully the next
debian stable release will be done before the end of the year, and will
be able to upgrade then. Worst case would be to upgrade just drbd from
testing, but again, would prefer against that unless there is a really
clear reason.
>> Linux san1 2.6.32-5-amd64 #1 SMP Sun May 6 04:00:17 UTC 2012 x86_64
>> GNU/Linux
>>
>> My config file for DRBD is:
>> resource storage2 {
>>     protocol A;
>>     device /dev/drbd2 minor 2;
>>     disk /dev/md1;
>>     meta-disk internal;
>>     on san1 {
>>         address 172.17.56.1:7802;
>>     }
>>     on san2 {
>>         address 172.17.56.2:7802;
>>     }
>>
>>     net {
>>         after-sb-0pri discard-younger-primary;
>>         after-sb-1pri discard-secondary;
>>         after-sb-2pri call-pri-lost-after-sb;
>>         max-buffers 8000;
>>         max-epoch-size 8000;
>>         unplug-watermark 4096;
>>         sndbuf-size 512k;
>>     }
>>     startup {
>>         wfc-timeout 10;
>>         degr-wfc-timeout 20;
>>     }
>>     syncer {
>>         rate 100M;
>>     }
>> }
>>
>> root at san1:/etc/drbd.d# cat /proc/drbd
>> version: 8.3.7 (api:88/proto:86-91)
>> srcversion: EE47D8BF18AC166BE219757
>>
>>  2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r----
>>     ns:30035906 nr:0 dw:29914667 dr:42547522 al:49271 bm:380 lo:0 pe:0
>> ua:0 ap:0 ep:1 wo:b oos:0
>>
>> root at san1:/etc/drbd.d# dd if=/dev/zero of=/dev/mapper/vg0-testdisk
>> oflag=direct bs=1M count=100
>> 100+0 records in
>> 100+0 records out
>> 104857600 bytes (105 MB) copied, 2.48851 s, 42.1 MB/s
> With reads of 4MB or 8MB blocks you might get better values.
Yes, slightly, but in either case they are equally different (ie, DRBD
connected or disconnected)
> But beware: continuous reading is a very bad indicator for performance unless 
> you do video-playback only. Better test with a real filesystem and a real 
> benchmark, I am using dbench with great success as it gives the typical usage 
> patterns of business computers.
In my use case, it is actually a good performance indicator for one
specific task that is done, although general usage is indeed much more
random read/write.
>> However, if I stop DRBD on the secondary:
>> Then I get good performance:
>> Can anyone suggest how I might improve performance while the secondary
>> is connected? In the worst case scenario, I would expect a write speed
>> to these drives between 100 and 150M/s. I can't test writing at the
>> moment, since they are used by DRBD, but read performance:
>> root at san2:/etc/drbd.d# dd if=/dev/md1 of=/dev/null oflag=dsync bs=1M
>> count=100
>> 100+0 records in
>> 100+0 records out
>> 104857600 bytes (105 MB) copied, 0.706124 s, 148 MB/s
>> I am using a single 1G crossover ethernet for DRBD sync
>> Should I increase the connection between the servers for sync to 2 x 1G
>> which would exceed max write speed of the disks on san2?
> First: get a dual-link, that made my disk be the limiting factor.
> Second: Use external metadata at least on the hdds and put the meta-data on a 
> different disk. That made my dual-gigabit-link be the limiting factor again...
I'm still waiting on doing this, but I really, really don't think this
is my limiting factor. I've spent some time collecting stats from
/proc/drbd and plotting those on a graph.

I'm seeing high numbers for the "Local Count" on the secondary peaking
at up to 8000 (and staying at around 8000 for more than 15 minutes).
However NS on the primary is between 1k and 5k, which I think is much
less than the single gigabit connection would allow.

Please see the two graphs:
http://www.websitemanagers.com.au/support/san1.png
http://www.websitemanagers.com.au/support/san2.png

I haven't graphed oos because this value becomes very large during the
day while the secondary DRBD is disconnected. In particular is the
period shown on san2 graph, which is the period the secondary is
connected to the primary.

What I need to know, is do I need to make the secondary SAN quicker to
allow everything to work faster?
Is there some settings that can be adjusted to allow the secondary to
get further behind, so that a burst of activity will be completed
faster, and the secondary will catch up afterwards?

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au