Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello,
On 23/08/11 00:25, Aj Mirani wrote:
> Hi Cédric,
>
> I was able to replicate your results in my environment. The large block 'dd' test saw the biggest improvement in transfer rate when I dropped the ib_sdp module's recv_poll to 100usec.
>
> The small block 'dd' test saw no significant change from modifying recv_poll, but did see a marginally better improvement from increasing sndbuf-size.
Thanks for that feedback!
> I would agree with you that it seems SDP doesn't like smaller blocks (and it might just be my ignorance as to how the SDP protocol works.) I might go digging through the RFC if I can't sort it out because for our application we do almost nothing but small writes.
>
> Although if thats the case I wonder why Netpipe can get better performance over SDP than IP. (See test results for Netpipe at the bottom.)
Since I'm completely out of my depth here, I can only express my
intuition, which is that all those results point to some flaw in the SDP
implementation of DRBD. I remember looking at the patch that brought SDP
support to DRBD and it is very simple (just a matter of instantiating a
SDP socket instead of a TCP one, IIRC). Maybe, in the case of DRBD and
its traffic patterns, some further "intelligence" would be needed.
Again, I might be totally wrong. Let's hope for some knowledgeable one
to stumble on our messages and shed some light on the matter.
Cheers,
Cédric
> Here are my results after making the same changes you did:
>
> =============================================================================
> SDP - after sndbuf-size=10240k;
> Large dd test slightly better but on avg more or less the same
> Small dd test slightly better
> =============================================================================
> IP - after sndbuf-size=10240k;
> Large dd test slightly better but on avg more or less the same
> Small dd test slightly better
> =============================================================================
> SDP - after sndbuf-size=10240k and ib_sdp recv_poll 100;
> Large dd test significant improvement!
> Small dd test no change
>
> # dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
> 4+0 records in
> 4+0 records out
> 2147483648 bytes (2.1 GB) copied, 3.28283 s, 654 MB/s <- excellent! faster than IP.
>
> Here are the previous SDP results if you recall:
> # dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
> 4+0 records in
> 4+0 records out
> 2147483648 bytes (2.1 GB) copied, 12.507 s, 172 MB/s
>
> And previous IP results:
> # dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
> 4+0 records in
> 4+0 records out
> 2147483648 bytes (2.1 GB) copied, 5.1764 s, 415 MB/s
> =============================================================================
>
>
>
>
> Here is why I expected the small block dd SDP test to outperform IP:
> =============================================================================
> The through-put of IP over Infiniband was tested using Netpipe:
>
> nodeA# NPtcp
> nodeB# NPtcp -h 10.0.99.108
>
> Send and receive buffers are 16384 and 87380 bytes
> (A bug in Linux doubles the requested buffer sizes)
> Now starting the main loop
> 0: 1 bytes 2912 times --> 0.28 Mbps in 27.64 usec
> 1: 2 bytes 3617 times --> 0.57 Mbps in 26.77 usec
> 2: 3 bytes 3735 times --> 0.83 Mbps in 27.65 usec
> 3: 4 bytes 2411 times --> 1.10 Mbps in 27.71 usec
> 4: 6 bytes 2706 times --> 1.70 Mbps in 26.85 usec
> 5: 8 bytes 1862 times --> 2.27 Mbps in 26.92 usec
> .
> .
> .
> 117: 4194307 bytes 6 times --> 4446.10 Mbps in 7197.32 usec
> 118: 6291453 bytes 6 times --> 5068.46 Mbps in 9470.32 usec
> 119: 6291456 bytes 7 times --> 4873.45 Mbps in 9849.29 usec
> 120: 6291459 bytes 6 times --> 4454.66 Mbps in 10775.25 usec
> 121: 8388605 bytes 3 times --> 4651.95 Mbps in 13757.67 usec
> 122: 8388608 bytes 3 times --> 4816.20 Mbps in 13288.50 usec
> 123: 8388611 bytes 3 times --> 4977.90 Mbps in 12856.84 usec
>
> =============================================================================
> The through-put of SDP over Infiniband was tested using Netpipe:
>
> nodeA# LD_PRELOAD=libsdp.so NPtcp
> nodeB# LD_PRELOAD=libsdp.so NPtcp -h 10.0.99.108
>
> Send and receive buffers are 126976 and 126976 bytes
> (A bug in Linux doubles the requested buffer sizes)
> Now starting the main loop
> 0: 1 bytes 17604 times --> 1.54 Mbps in 4.95 usec
> 1: 2 bytes 20215 times --> 3.11 Mbps in 4.91 usec
> 2: 3 bytes 20380 times --> 4.67 Mbps in 4.90 usec
> 3: 4 bytes 13608 times --> 6.15 Mbps in 4.96 usec
> 4: 6 bytes 15116 times --> 9.25 Mbps in 4.95 usec
> 5: 8 bytes 10100 times --> 12.38 Mbps in 4.93 usec
> .
> .
> .
> 117: 4194307 bytes 15 times --> 9846.87 Mbps in 3249.76 usec
> 118: 6291453 bytes 15 times --> 9745.60 Mbps in 4925.30 usec
> 119: 6291456 bytes 13 times --> 9719.69 Mbps in 4938.43 usec
> 120: 6291459 bytes 13 times --> 9721.99 Mbps in 4937.26 usec
> 121: 8388605 bytes 6 times --> 9714.01 Mbps in 6588.42 usec
> 122: 8388608 bytes 7 times --> 9713.48 Mbps in 6588.78 usec
> 123: 8388611 bytes 7 times --> 9731.95 Mbps in 6576.28 usec
>
> =============================================================================
>
>
>
>
>
> -aj
>
>
>
>
> On Mon, Aug 22, 2011 at 09:28:27AM +0200, Cédric Dufour - Idiap Research Institute wrote:
>> Hello,
>>
>> Have you seen my post on (quite) the same subject:
>> http://lists.linbit.com/pipermail/drbd-user/2011-July/016598.html ?
>>
>> Based on your experiments and mine, it would seem that SDP does not like
>> "transferring small bits of data" (not being a TCP/SDP guru, I don't
>> know how to put it more appropriately). This would somehow correlate
>> with my finding of needing to increase the 'sndbuf-size' as much as
>> possible. And this also correlates with the fact that initial sync or
>> "dd" test with large block size actually use SDP very efficiently, while
>> operations involving smaller "data bits" don't.
>>
>> I'm curious whether playing with the 'sndbuf-size' and ib_sdp's
>> 'recv_poll' parameters would affect your setup the same way it did mine.
>>
>> Cheers,
>>
>> Cédric
>>
>> On 19/08/11 21:45, Aj Mirani wrote:
>>> I'm currently testing DRBD over Infiniband/SDP vs Infiniband/IP.
>>>
>>> My configuration is as follows:
>>> DRBD 8.3.11 (Protocol C)
>>> Linux kernel 2.6.39
>>> OFED 1.5.4
>>> Infiniband: Mellanox Technologies MT26428
>>>
>>> My baseline test was to attempt a resync of the secondary node using Infiniband over IP. I noted the sync rate. Once complete, I performed some other very rudimentary tests using 'dd' and 'mkfs' to get a sense of actual performance. Then I shutdown DRBD on both primary and secondary, modified the config to use SDP and started it back up to re-try all of the tests.
>>>
>>> original:
>>> address 10.0.99.108:7790 ;
>>> to use SDP:
>>> address sdp 10.0.99.108:7790 ;
>>>
>>> No other config changes were made.
>>>
>>> After this, I issued "drbdadm invalidate-remote all" on the primary to force a re-sync. I noted my sync rate almost doubled, which was excellent.
>>>
>>> Once the sync was complete I re-attempted my other tests. Amazingly every tests using Infiniband over SDP performed significantly worse than Infiniband over IP.
>>>
>>> Is there anything that can explain this?
>>>
>>>
>>> Here are my actual tests/results for each config:
>>> =============================================================================
>>> Infiniband over IP
>>> =============================================================================
>>> # dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
>>> 4+0 records in
>>> 4+0 records out
>>> 2147483648 bytes (2.1 GB) copied, 5.1764 s, 415 MB/s
>>>
>>> # dd if=/dev/zero of=/dev/drbd0 bs=4k count=100 oflag=direct
>>> 100+0 records in
>>> 100+0 records out
>>> 409600 bytes (410 kB) copied, 0.0232504 s, 17.6 MB/s
>>>
>>> # time mkfs.ext4 /dev/drbd0
>>> real 3m54.848s
>>> user 0m4.272s
>>> sys 0m37.758s
>>>
>>>
>>> =============================================================================
>>> Infiniband over SDP
>>> =============================================================================
>>> # dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
>>> 4+0 records in
>>> 4+0 records out
>>> 2147483648 bytes (2.1 GB) copied, 12.507 s, 172 MB/s <--- (2.4x slower)
>>>
>>> # dd if=/dev/zero of=/dev/drbd0 bs=4k count=100 oflag=direct
>>> 100+0 records in
>>> 100+0 records out
>>> 409600 bytes (410 kB) copied, 19.6418 s, 20.9 kB/s <--- (844x slower)
>>>
>>> # time mkfs.ext4 /dev/drbd0
>>> real 10m12.337s <--- (4.25x slower)
>>> user 0m4.336s
>>> sys 0m39.866s
>>>
>>>
>>> =============================================================================
>>>
>>> At the same time I've used the netpipe benchmark to test Infiniband SDP performance, and it looks good.
>>>
>>> netpipe benchmark using:
>>> nodeA# LD_PRELOAD=libsdp.so NPtcp
>>> nodeB# LD_PRELOAD=libsdp.so NPtcp -h 10.0.99.108
>>>
>>> It consistently out performs Infiniband/IP as I would expect. So this leaves me thinking there is either a problem with my DRBD config or DRBD is using SDP differently for re-sync vs keeping in sync or my testing is flawed.
>>>
>>>
>>> Here is what my config looks like:
>>> # drbdsetup /dev/drbd0 show
>>> disk {
>>> size 0s _is_default; # bytes
>>> on-io-error pass_on _is_default;
>>> fencing dont-care _is_default;
>>> no-disk-flushes ;
>>> no-md-flushes ;
>>> max-bio-bvecs 0 _is_default;
>>> }
>>> net {
>>> timeout 60 _is_default; # 1/10 seconds
>>> max-epoch-size 8192;
>>> max-buffers 8192;
>>> unplug-watermark 16384;
>>> connect-int 10 _is_default; # seconds
>>> ping-int 10 _is_default; # seconds
>>> sndbuf-size 0 _is_default; # bytes
>>> rcvbuf-size 0 _is_default; # bytes
>>> ko-count 4;
>>> after-sb-0pri disconnect _is_default;
>>> after-sb-1pri disconnect _is_default;
>>> after-sb-2pri disconnect _is_default;
>>> rr-conflict disconnect _is_default;
>>> ping-timeout 5 _is_default; # 1/10 seconds
>>> on-congestion block _is_default;
>>> congestion-fill 0s _is_default; # byte
>>> congestion-extents 127 _is_default;
>>> }
>>> syncer {
>>> rate 524288k; # bytes/second
>>> after -1 _is_default;
>>> al-extents 3833;
>>> cpu-mask "15";
>>> on-no-data-accessible io-error _is_default;
>>> c-plan-ahead 0 _is_default; # 1/10 seconds
>>> c-delay-target 10 _is_default; # 1/10 seconds
>>> c-fill-target 0s _is_default; # bytes
>>> c-max-rate 102400k _is_default; # bytes/second
>>> c-min-rate 4096k _is_default; # bytes/second
>>> }
>>> protocol C;
>>> _this_host {
>>> device minor 0;
>>> disk "/dev/sdc1";
>>> meta-disk internal;
>>> address sdp 10.0.99.108:7790;
>>> }
>>> _remote_host {
>>> address ipv4 10.0.99.107:7790;
>>>
>>>
>>> Any insight would be greatly appreciated.
>>>
>>>