[DRBD-user] DRBD over Infiniband (SDP) performance oddity

Cédric Dufour - Idiap Research Institute cedric.dufour at idiap.ch
Tue Aug 23 13:48:40 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

On 23/08/11 00:25, Aj Mirani wrote:
> Hi Cédric,
>
> I was able to replicate your results in my environment. The large block 'dd' test saw the biggest improvement in transfer rate when I dropped the ib_sdp module's recv_poll to 100usec. 
>
> The small block 'dd' test saw no significant change from modifying recv_poll, but did see a marginally better improvement from increasing sndbuf-size.
Thanks for that feedback!
> I would agree with you that it seems SDP doesn't like smaller blocks (and it might just be my ignorance as to how the SDP protocol works.)  I might go digging through the RFC if I can't sort it out because for our application we do almost nothing but small writes.
>
> Although if thats the case I wonder why Netpipe can get better performance over SDP than IP.  (See test results for Netpipe at the bottom.)
Since I'm completely out of my depth here, I can only express my
intuition, which is that all those results point to some flaw in the SDP
implementation of DRBD. I remember looking at the patch that brought SDP
support to DRBD and it is very simple (just a matter of instantiating a
SDP socket instead of a TCP one, IIRC). Maybe, in the case of DRBD and
its traffic patterns, some further "intelligence" would be needed.
Again, I might be totally wrong. Let's hope for some knowledgeable one
to stumble on our messages and shed some light on the matter.

Cheers,

Cédric
> Here are my results after making the same changes you did:
>
> =============================================================================
> SDP - after sndbuf-size=10240k;
> Large dd test slightly better but on avg more or less the same
> Small dd test slightly better
> =============================================================================
> IP - after sndbuf-size=10240k;
> Large dd test slightly better but on avg more or less the same
> Small dd test slightly better
> =============================================================================
> SDP - after sndbuf-size=10240k and ib_sdp recv_poll 100;
> Large dd test significant improvement!
> Small dd test no change
>
> # dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
> 4+0 records in
> 4+0 records out
> 2147483648 bytes (2.1 GB) copied, 3.28283 s, 654 MB/s  <- excellent! faster than IP.
>
> Here are the previous SDP results if you recall:
> # dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
> 4+0 records in
> 4+0 records out
> 2147483648 bytes (2.1 GB) copied, 12.507 s, 172 MB/s 
>
> And previous IP results:
> # dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
> 4+0 records in
> 4+0 records out
> 2147483648 bytes (2.1 GB) copied, 5.1764 s, 415 MB/s
> =============================================================================
>
>
>
>
> Here is why I expected the small block dd SDP test to outperform IP:
> =============================================================================
> The through-put of IP over Infiniband was tested using Netpipe:
>
>     nodeA# NPtcp
>     nodeB# NPtcp -h 10.0.99.108
>
> Send and receive buffers are 16384 and 87380 bytes
> (A bug in Linux doubles the requested buffer sizes)
> Now starting the main loop
>   0:       1 bytes   2912 times -->      0.28 Mbps in      27.64 usec
>   1:       2 bytes   3617 times -->      0.57 Mbps in      26.77 usec
>   2:       3 bytes   3735 times -->      0.83 Mbps in      27.65 usec
>   3:       4 bytes   2411 times -->      1.10 Mbps in      27.71 usec
>   4:       6 bytes   2706 times -->      1.70 Mbps in      26.85 usec
>   5:       8 bytes   1862 times -->      2.27 Mbps in      26.92 usec
>   .
>   .
>   .
> 117: 4194307 bytes      6 times -->   4446.10 Mbps in    7197.32 usec
> 118: 6291453 bytes      6 times -->   5068.46 Mbps in    9470.32 usec
> 119: 6291456 bytes      7 times -->   4873.45 Mbps in    9849.29 usec
> 120: 6291459 bytes      6 times -->   4454.66 Mbps in   10775.25 usec
> 121: 8388605 bytes      3 times -->   4651.95 Mbps in   13757.67 usec
> 122: 8388608 bytes      3 times -->   4816.20 Mbps in   13288.50 usec
> 123: 8388611 bytes      3 times -->   4977.90 Mbps in   12856.84 usec
>
> =============================================================================
> The through-put of SDP over Infiniband was tested using Netpipe:
>
>     nodeA# LD_PRELOAD=libsdp.so NPtcp 
>     nodeB# LD_PRELOAD=libsdp.so  NPtcp -h 10.0.99.108
>
> Send and receive buffers are 126976 and 126976 bytes
> (A bug in Linux doubles the requested buffer sizes)
> Now starting the main loop
>   0:       1 bytes  17604 times -->      1.54 Mbps in       4.95 usec
>   1:       2 bytes  20215 times -->      3.11 Mbps in       4.91 usec
>   2:       3 bytes  20380 times -->      4.67 Mbps in       4.90 usec
>   3:       4 bytes  13608 times -->      6.15 Mbps in       4.96 usec
>   4:       6 bytes  15116 times -->      9.25 Mbps in       4.95 usec
>   5:       8 bytes  10100 times -->     12.38 Mbps in       4.93 usec
>   .
>   .
>   .
> 117: 4194307 bytes     15 times -->   9846.87 Mbps in    3249.76 usec
> 118: 6291453 bytes     15 times -->   9745.60 Mbps in    4925.30 usec
> 119: 6291456 bytes     13 times -->   9719.69 Mbps in    4938.43 usec
> 120: 6291459 bytes     13 times -->   9721.99 Mbps in    4937.26 usec
> 121: 8388605 bytes      6 times -->   9714.01 Mbps in    6588.42 usec
> 122: 8388608 bytes      7 times -->   9713.48 Mbps in    6588.78 usec
> 123: 8388611 bytes      7 times -->   9731.95 Mbps in    6576.28 usec
>
> =============================================================================
>
>
>
>
>
> 			-aj
>
>
>
>
> On Mon, Aug 22, 2011 at 09:28:27AM +0200, Cédric Dufour - Idiap Research Institute wrote:
>> Hello,
>>
>> Have you seen my post on (quite) the same subject:
>> http://lists.linbit.com/pipermail/drbd-user/2011-July/016598.html ?
>>
>> Based on your experiments and mine, it would seem that SDP does not like
>> "transferring small bits of data" (not being a TCP/SDP guru, I don't
>> know how to put it more appropriately). This would somehow correlate
>> with my finding of needing to increase the 'sndbuf-size' as much as
>> possible. And this also correlates with the fact that initial sync or
>> "dd" test with large block size actually use SDP very efficiently, while
>> operations involving smaller "data bits" don't.
>>
>> I'm curious whether playing with the 'sndbuf-size' and ib_sdp's
>> 'recv_poll' parameters would affect your setup the same way it did mine.
>>
>> Cheers,
>>
>> Cédric
>>
>> On 19/08/11 21:45, Aj Mirani wrote:
>>> I'm currently testing DRBD over Infiniband/SDP vs Infiniband/IP.  
>>>
>>> My configuration is as follows:
>>> DRBD 8.3.11 (Protocol C)
>>> Linux kernel 2.6.39 
>>> OFED 1.5.4
>>> Infiniband: Mellanox Technologies MT26428
>>>
>>> My baseline test was to attempt a resync of the secondary node using Infiniband over IP.  I noted the sync rate. Once complete, I performed some other very rudimentary tests using 'dd' and 'mkfs' to get a sense of actual performance.  Then I shutdown DRBD on both primary and secondary, modified the config to use SDP and started it back up to re-try all of the tests.
>>>
>>> original:
>>>     address   10.0.99.108:7790 ;
>>> to use SDP:
>>>     address   sdp 10.0.99.108:7790 ;
>>>
>>> No other config changes were made.
>>>
>>> After this, I issued "drbdadm invalidate-remote all" on the primary to force a re-sync.  I noted my sync rate almost doubled, which was excellent.
>>>
>>> Once the sync was complete I re-attempted my other tests.  Amazingly every tests using Infiniband over SDP performed significantly worse than Infiniband over IP.  
>>>
>>> Is there anything that can explain this? 
>>>
>>>
>>> Here are my actual tests/results for each config:
>>> =============================================================================
>>> Infiniband over IP
>>> =============================================================================
>>> # dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
>>> 4+0 records in
>>> 4+0 records out
>>> 2147483648 bytes (2.1 GB) copied, 5.1764 s, 415 MB/s
>>>
>>> # dd if=/dev/zero of=/dev/drbd0 bs=4k count=100 oflag=direct
>>> 100+0 records in
>>> 100+0 records out
>>> 409600 bytes (410 kB) copied, 0.0232504 s, 17.6 MB/s
>>>
>>> # time mkfs.ext4 /dev/drbd0
>>> real    3m54.848s
>>> user    0m4.272s
>>> sys     0m37.758s
>>>
>>>
>>> =============================================================================
>>> Infiniband over SDP
>>> =============================================================================
>>> # dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
>>> 4+0 records in
>>> 4+0 records out
>>> 2147483648 bytes (2.1 GB) copied, 12.507 s, 172 MB/s    <--- (2.4x slower)
>>>
>>> # dd if=/dev/zero of=/dev/drbd0 bs=4k count=100 oflag=direct
>>> 100+0 records in
>>> 100+0 records out
>>> 409600 bytes (410 kB) copied, 19.6418 s, 20.9 kB/s      <--- (844x slower)
>>>
>>> # time mkfs.ext4 /dev/drbd0
>>> real    10m12.337s                                      <--- (4.25x slower)
>>> user    0m4.336s
>>> sys     0m39.866s
>>>
>>>
>>> =============================================================================
>>>
>>> At the same time I've used the netpipe benchmark to test Infiniband SDP performance, and it looks good.  
>>>
>>> netpipe benchmark using:
>>>     nodeA# LD_PRELOAD=libsdp.so NPtcp 
>>>     nodeB# LD_PRELOAD=libsdp.so  NPtcp -h 10.0.99.108
>>>
>>> It consistently out performs Infiniband/IP as I would expect.  So this leaves me thinking there is either a problem with my DRBD config or DRBD is using SDP differently for re-sync vs keeping in sync or my testing is flawed.
>>>
>>>
>>> Here is what my config looks like:
>>> # drbdsetup /dev/drbd0 show
>>> disk {
>>>         size                    0s _is_default; # bytes
>>>         on-io-error             pass_on _is_default;
>>>         fencing                 dont-care _is_default;
>>>         no-disk-flushes ;
>>>         no-md-flushes   ;
>>>         max-bio-bvecs           0 _is_default;
>>> }
>>> net {
>>>         timeout                 60 _is_default; # 1/10 seconds
>>>         max-epoch-size          8192;
>>>         max-buffers             8192;
>>>         unplug-watermark        16384;
>>>         connect-int             10 _is_default; # seconds
>>>         ping-int                10 _is_default; # seconds
>>>         sndbuf-size             0 _is_default; # bytes
>>>         rcvbuf-size             0 _is_default; # bytes
>>>         ko-count                4;
>>>         after-sb-0pri           disconnect _is_default;
>>>         after-sb-1pri           disconnect _is_default;
>>>         after-sb-2pri           disconnect _is_default;
>>>         rr-conflict             disconnect _is_default;
>>>         ping-timeout            5 _is_default; # 1/10 seconds
>>>         on-congestion           block _is_default;
>>>         congestion-fill         0s _is_default; # byte
>>>         congestion-extents      127 _is_default;
>>> }
>>> syncer {
>>>         rate                    524288k; # bytes/second
>>>         after                   -1 _is_default;
>>>         al-extents              3833;
>>>         cpu-mask                "15";
>>>         on-no-data-accessible   io-error _is_default;
>>>         c-plan-ahead            0 _is_default; # 1/10 seconds
>>>         c-delay-target          10 _is_default; # 1/10 seconds
>>>         c-fill-target           0s _is_default; # bytes
>>>         c-max-rate              102400k _is_default; # bytes/second
>>>         c-min-rate              4096k _is_default; # bytes/second
>>> }
>>> protocol C;
>>> _this_host {
>>>         device                  minor 0;
>>>         disk                    "/dev/sdc1";
>>>         meta-disk               internal;
>>>         address                 sdp 10.0.99.108:7790;
>>> }
>>> _remote_host {
>>>         address                 ipv4 10.0.99.107:7790;
>>>
>>>
>>> Any insight would be greatly appreciated.
>>>
>>>



More information about the drbd-user mailing list