Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Cédric,
I was able to replicate your results in my environment. The large block 'dd' test saw the biggest improvement in transfer rate when I dropped the ib_sdp module's recv_poll to 100usec.
The small block 'dd' test saw no significant change from modifying recv_poll, but did see a marginally better improvement from increasing sndbuf-size.
I would agree with you that it seems SDP doesn't like smaller blocks (and it might just be my ignorance as to how the SDP protocol works.) I might go digging through the RFC if I can't sort it out because for our application we do almost nothing but small writes.
Although if thats the case I wonder why Netpipe can get better performance over SDP than IP. (See test results for Netpipe at the bottom.)
Here are my results after making the same changes you did:
=============================================================================
SDP - after sndbuf-size=10240k;
Large dd test slightly better but on avg more or less the same
Small dd test slightly better
=============================================================================
IP - after sndbuf-size=10240k;
Large dd test slightly better but on avg more or less the same
Small dd test slightly better
=============================================================================
SDP - after sndbuf-size=10240k and ib_sdp recv_poll 100;
Large dd test significant improvement!
Small dd test no change
# dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
4+0 records in
4+0 records out
2147483648 bytes (2.1 GB) copied, 3.28283 s, 654 MB/s <- excellent! faster than IP.
Here are the previous SDP results if you recall:
# dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
4+0 records in
4+0 records out
2147483648 bytes (2.1 GB) copied, 12.507 s, 172 MB/s
And previous IP results:
# dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
4+0 records in
4+0 records out
2147483648 bytes (2.1 GB) copied, 5.1764 s, 415 MB/s
=============================================================================
Here is why I expected the small block dd SDP test to outperform IP:
=============================================================================
The through-put of IP over Infiniband was tested using Netpipe:
nodeA# NPtcp
nodeB# NPtcp -h 10.0.99.108
Send and receive buffers are 16384 and 87380 bytes
(A bug in Linux doubles the requested buffer sizes)
Now starting the main loop
0: 1 bytes 2912 times --> 0.28 Mbps in 27.64 usec
1: 2 bytes 3617 times --> 0.57 Mbps in 26.77 usec
2: 3 bytes 3735 times --> 0.83 Mbps in 27.65 usec
3: 4 bytes 2411 times --> 1.10 Mbps in 27.71 usec
4: 6 bytes 2706 times --> 1.70 Mbps in 26.85 usec
5: 8 bytes 1862 times --> 2.27 Mbps in 26.92 usec
.
.
.
117: 4194307 bytes 6 times --> 4446.10 Mbps in 7197.32 usec
118: 6291453 bytes 6 times --> 5068.46 Mbps in 9470.32 usec
119: 6291456 bytes 7 times --> 4873.45 Mbps in 9849.29 usec
120: 6291459 bytes 6 times --> 4454.66 Mbps in 10775.25 usec
121: 8388605 bytes 3 times --> 4651.95 Mbps in 13757.67 usec
122: 8388608 bytes 3 times --> 4816.20 Mbps in 13288.50 usec
123: 8388611 bytes 3 times --> 4977.90 Mbps in 12856.84 usec
=============================================================================
The through-put of SDP over Infiniband was tested using Netpipe:
nodeA# LD_PRELOAD=libsdp.so NPtcp
nodeB# LD_PRELOAD=libsdp.so NPtcp -h 10.0.99.108
Send and receive buffers are 126976 and 126976 bytes
(A bug in Linux doubles the requested buffer sizes)
Now starting the main loop
0: 1 bytes 17604 times --> 1.54 Mbps in 4.95 usec
1: 2 bytes 20215 times --> 3.11 Mbps in 4.91 usec
2: 3 bytes 20380 times --> 4.67 Mbps in 4.90 usec
3: 4 bytes 13608 times --> 6.15 Mbps in 4.96 usec
4: 6 bytes 15116 times --> 9.25 Mbps in 4.95 usec
5: 8 bytes 10100 times --> 12.38 Mbps in 4.93 usec
.
.
.
117: 4194307 bytes 15 times --> 9846.87 Mbps in 3249.76 usec
118: 6291453 bytes 15 times --> 9745.60 Mbps in 4925.30 usec
119: 6291456 bytes 13 times --> 9719.69 Mbps in 4938.43 usec
120: 6291459 bytes 13 times --> 9721.99 Mbps in 4937.26 usec
121: 8388605 bytes 6 times --> 9714.01 Mbps in 6588.42 usec
122: 8388608 bytes 7 times --> 9713.48 Mbps in 6588.78 usec
123: 8388611 bytes 7 times --> 9731.95 Mbps in 6576.28 usec
=============================================================================
-aj
On Mon, Aug 22, 2011 at 09:28:27AM +0200, Cédric Dufour - Idiap Research Institute wrote:
> Hello,
>
> Have you seen my post on (quite) the same subject:
> http://lists.linbit.com/pipermail/drbd-user/2011-July/016598.html ?
>
> Based on your experiments and mine, it would seem that SDP does not like
> "transferring small bits of data" (not being a TCP/SDP guru, I don't
> know how to put it more appropriately). This would somehow correlate
> with my finding of needing to increase the 'sndbuf-size' as much as
> possible. And this also correlates with the fact that initial sync or
> "dd" test with large block size actually use SDP very efficiently, while
> operations involving smaller "data bits" don't.
>
> I'm curious whether playing with the 'sndbuf-size' and ib_sdp's
> 'recv_poll' parameters would affect your setup the same way it did mine.
>
> Cheers,
>
> Cédric
>
> On 19/08/11 21:45, Aj Mirani wrote:
> > I'm currently testing DRBD over Infiniband/SDP vs Infiniband/IP.
> >
> > My configuration is as follows:
> > DRBD 8.3.11 (Protocol C)
> > Linux kernel 2.6.39
> > OFED 1.5.4
> > Infiniband: Mellanox Technologies MT26428
> >
> > My baseline test was to attempt a resync of the secondary node using Infiniband over IP. I noted the sync rate. Once complete, I performed some other very rudimentary tests using 'dd' and 'mkfs' to get a sense of actual performance. Then I shutdown DRBD on both primary and secondary, modified the config to use SDP and started it back up to re-try all of the tests.
> >
> > original:
> > address 10.0.99.108:7790 ;
> > to use SDP:
> > address sdp 10.0.99.108:7790 ;
> >
> > No other config changes were made.
> >
> > After this, I issued "drbdadm invalidate-remote all" on the primary to force a re-sync. I noted my sync rate almost doubled, which was excellent.
> >
> > Once the sync was complete I re-attempted my other tests. Amazingly every tests using Infiniband over SDP performed significantly worse than Infiniband over IP.
> >
> > Is there anything that can explain this?
> >
> >
> > Here are my actual tests/results for each config:
> > =============================================================================
> > Infiniband over IP
> > =============================================================================
> > # dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
> > 4+0 records in
> > 4+0 records out
> > 2147483648 bytes (2.1 GB) copied, 5.1764 s, 415 MB/s
> >
> > # dd if=/dev/zero of=/dev/drbd0 bs=4k count=100 oflag=direct
> > 100+0 records in
> > 100+0 records out
> > 409600 bytes (410 kB) copied, 0.0232504 s, 17.6 MB/s
> >
> > # time mkfs.ext4 /dev/drbd0
> > real 3m54.848s
> > user 0m4.272s
> > sys 0m37.758s
> >
> >
> > =============================================================================
> > Infiniband over SDP
> > =============================================================================
> > # dd if=/dev/zero of=/dev/drbd0 bs=512M count=4 oflag=direct
> > 4+0 records in
> > 4+0 records out
> > 2147483648 bytes (2.1 GB) copied, 12.507 s, 172 MB/s <--- (2.4x slower)
> >
> > # dd if=/dev/zero of=/dev/drbd0 bs=4k count=100 oflag=direct
> > 100+0 records in
> > 100+0 records out
> > 409600 bytes (410 kB) copied, 19.6418 s, 20.9 kB/s <--- (844x slower)
> >
> > # time mkfs.ext4 /dev/drbd0
> > real 10m12.337s <--- (4.25x slower)
> > user 0m4.336s
> > sys 0m39.866s
> >
> >
> > =============================================================================
> >
> > At the same time I've used the netpipe benchmark to test Infiniband SDP performance, and it looks good.
> >
> > netpipe benchmark using:
> > nodeA# LD_PRELOAD=libsdp.so NPtcp
> > nodeB# LD_PRELOAD=libsdp.so NPtcp -h 10.0.99.108
> >
> > It consistently out performs Infiniband/IP as I would expect. So this leaves me thinking there is either a problem with my DRBD config or DRBD is using SDP differently for re-sync vs keeping in sync or my testing is flawed.
> >
> >
> > Here is what my config looks like:
> > # drbdsetup /dev/drbd0 show
> > disk {
> > size 0s _is_default; # bytes
> > on-io-error pass_on _is_default;
> > fencing dont-care _is_default;
> > no-disk-flushes ;
> > no-md-flushes ;
> > max-bio-bvecs 0 _is_default;
> > }
> > net {
> > timeout 60 _is_default; # 1/10 seconds
> > max-epoch-size 8192;
> > max-buffers 8192;
> > unplug-watermark 16384;
> > connect-int 10 _is_default; # seconds
> > ping-int 10 _is_default; # seconds
> > sndbuf-size 0 _is_default; # bytes
> > rcvbuf-size 0 _is_default; # bytes
> > ko-count 4;
> > after-sb-0pri disconnect _is_default;
> > after-sb-1pri disconnect _is_default;
> > after-sb-2pri disconnect _is_default;
> > rr-conflict disconnect _is_default;
> > ping-timeout 5 _is_default; # 1/10 seconds
> > on-congestion block _is_default;
> > congestion-fill 0s _is_default; # byte
> > congestion-extents 127 _is_default;
> > }
> > syncer {
> > rate 524288k; # bytes/second
> > after -1 _is_default;
> > al-extents 3833;
> > cpu-mask "15";
> > on-no-data-accessible io-error _is_default;
> > c-plan-ahead 0 _is_default; # 1/10 seconds
> > c-delay-target 10 _is_default; # 1/10 seconds
> > c-fill-target 0s _is_default; # bytes
> > c-max-rate 102400k _is_default; # bytes/second
> > c-min-rate 4096k _is_default; # bytes/second
> > }
> > protocol C;
> > _this_host {
> > device minor 0;
> > disk "/dev/sdc1";
> > meta-disk internal;
> > address sdp 10.0.99.108:7790;
> > }
> > _remote_host {
> > address ipv4 10.0.99.107:7790;
> >
> >
> > Any insight would be greatly appreciated.
> >
> >
--
Aj Mirani
Operations Manager, Tucows Inc.
416-535-0123 x1294