<div>On Thu, Sep 30, 2010 at 11:22 AM, Bart Coninckx <span dir="ltr"><<a href="mailto:bart.coninckx@telenet.be">bart.coninckx@telenet.be</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; ">
Hi all,<br><br>I remember doing some research about bonding more than two network cards and<br>having found that Linbit had shown that this does not really improve<br>performance because of the TCP reordering.<br><br>I was just wondering if this is still the case, provided more recent<br>
developments with new hardware and stuff.<br><br>I just saw bonnie++ hit more than 250 MB/sec while my bonded gigabit gives me<br>about 160 MB/sec with 30% TCP header penalty, so looking into this is useful.<br><br>If not, I will be looking at 10Gb cards I guess ...<br>
</blockquote></div><div><br></div><div>Hi there,</div><div><br></div><div>So I'll answer your direct question, and then I'll answer the question I think you really want to know--what's the best interconnect for DRBD--as I've been do a lot of testing in that area:</div>
<div><br></div>Results of bonding many GigE connections probably depends on your hardware. Look for an RDMA-based solution instead of hardware that requires interrupts for passing network traffic: igb.ko based Intel NICs, bnx2.ko based Broadcom NICs, etc. You want to tune your TCP windows higher. There's supposed to be a provision in NAPI now that let's the bonding driver tell the NIC not to not coalesce segments in hardware and to do TCP segment coalesce at a higher layer to completely avoid the reordering problem. I found that whatever tuning I was already doing on the network stack was enough to give me line-speed performance on a BCM5709C (common onboard) with the bnx2 driver and a dual port setup. Sustained 1.95Gbit, resulted in 234MB/sec actual write throughput through DRBD.<div>
<br></div><div>I didn't try with 3-bonds because that's just my backup.</div><div><br></div><div>Part of the problem with 10Gbit Ethernet is that it can have latency higher than regular GigE, and well implemented 10 Gbe and well implemented GigE both have about the same latency.</div>
<div><br></div><div>I've been working on creating a respectably high-performance DRBD setup. I've tested Dolphin Express (DX) with SuperSockets, QDR Infiniband, and 10 GbE (VPI Ethernet mode). Dolphin Express I think is great if it's compatible with your server, for whatever reason it was not compatible with my intended production gear. It's primary advantages are transparent fail-over to Ethernet, transparently redundant links when put back-to-back (ie 2 links merged into 1 fabric), and a well optimized software stack in general. It is also well supported by DRBD. If you don't need more than 500-600 MB/sec sustained write throughput, I think Dolphin is great.<br>
<br></div><div>QDR (40Gbit) InfiniBand's primary advantages are raw performance, flexibility, wide-spread adoption, and shear scalability. It is more enterprise ready and may be better supported on your server hardware. On the downside, it doesn't have transparent fail-over of any sort in a back-to-back configuration; it can neither fail-over transparently between IB ports or to a backup Ethernet interconnect. IB ports bond together only in active-passive mode, and only if they are on the same fabric. In a back-to-back configuration each connected port-pair is separate fabric so ib-bonding doesn't work back-to-back as there are 2 fabrics in play.</div>
<div><br></div><div>Anyway, here are some numbers. Unfortunately, I don't have any pure throughput numbers from within kernel-space which is what matters to DRBD. Interestingly enough, kernel-space socket performance can differ by quite a lot from user-space socket performance.</div>
<div><br></div><div>Userspace netperf:</div><div><br></div><div>2-GigE bonded balance-rr</div><div><div>[root@node02 ~]# netperf -f g -H 192.168.90.1 -c -C</div><div>TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.90.1 (192.168.90.1) port 0 AF_INET</div>
<div>Recv Send Send Utilization Service Demand</div><div>Socket Socket Message Elapsed Send Recv Send Recv</div><div>Size Size Size Time Throughput local remote local remote</div>
<div>bytes bytes bytes secs. 10^9bits/s % S % S us/KB us/KB</div><div><br></div><div> 87380 65536 65536 10.04 1.95 0.75 0.85 0.757 0.855</div><div><br></div><div><br></div><div>
Dolphin Express (single 10Gbit DX port) SuperSockets = RDMA:</div><div>[root@node02 network-scripts]# LD_PRELOAD="libksupersockets.so" netperf -f g -H 192.168.90.1 -c -C</div><div>TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.90.1 (192.168.90.1) port 0 AF_INET</div>
<div>Recv Send Send Utilization Service Demand</div><div>Socket Socket Message Elapsed Send Recv Send Recv</div><div>Size Size Size Time Throughput local remote local remote</div>
<div>bytes bytes bytes secs. 10^9bits/s % S % S us/KB us/KB</div><div><br></div><div>129024 65536 65536 10.01 6.53 1.48 1.46 0.444 0.439</div><div><br></div><div><br></div><div>
QDR IB (1 port) IPoIB</div><div>root@node02 log]# netperf -f g -H 192.168.20.1 -c -C</div><div>TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.20.1 (192.168.20.1) port 0 AF_INET</div><div>Recv Send Send Utilization Service Demand</div>
<div>Socket Socket Message Elapsed Send Recv Send Recv</div><div>Size Size Size Time Throughput local remote local remote</div><div>bytes bytes bytes secs. 10^9bits/s % S % S us/KB us/KB</div>
<div><br></div><div> 87380 65536 65536 10.00 16.15 1.74 4.61 0.211 0.562</div><div><br></div><div>QDR IB (1 port) SDP (SocketDirect Protocol = RDMA)</div><div>[root@node02 log]# LD_PRELOAD="libsdp.so" netperf -f g -H 192.168.20.1 -c -C</div>
<div>TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.20.1 (192.168.20.1) port 0 AF_INET</div><div>Recv Send Send Utilization Service Demand</div><div>Socket Socket Message Elapsed Send Recv Send Recv</div>
<div>Size Size Size Time Throughput local remote local remote</div><div>bytes bytes bytes secs. 10^9bits/s % S % S us/KB us/KB</div><div><br></div><div> 87380 65536 65536 10.01 24.67 3.18 3.28 0.253 0.262</div>
</div><div><br></div><div><br></div><div>Userspace SDP above does the best at 24.67 Gbit/s, IPoIB is slower at 16.15 Gbit/s. However, I have not been able to get DRBD + SDP to perform anyway near as well as DRBD + IPoIB, which is interesting. DRBD + SDP maxes out around 400-450MB/sec write and resync speed. With IPoIB I'm getting ~620 MB/sec sync with sustained writes at 720MB/sec. Interestingly, resync speed was "only" 620MB/sec.</div>
<div><br></div><div><div># time dd if=/dev/zero bs=2048M count=20 of=/dev/drbd0</div><div>0+20 records in</div><div>0+20 records out</div><div>42949591040 bytes (43 GB) copied, 59.4198 seconds, 723 MB/s</div></div><div><br>
</div><div>At this point, single-thread performance is my bottleneck. The above is with a Xeon X5650s, but I expect the X5680s in production gear will do DRBD writes >800MB/sec. My backing storage is capable of 900MB/s throughput so I think I could reasonably get about 90% of that throughput.</div>
<div><br></div><div>The IB HCA's I'm using for VPI (Virtual Protocol Interface) which means they can be put into different encapsulation modes, ie InfiniBand or 10 GbE. Running in 10GbE mode, my write throughput was in the 300-400MB/sec range, same with resync speed. Running the adapter in IB mode with IP-over-InfiniBand (IPoIB) gave a substantial increase in performance at the cost of running an opensmd instance. Dolphin DX with SuperSockets outperforms raw 10GbE as well.</div>
<div><br><div class="gmail_quote">What kind of write throughput are you looking for?</div></div><div class="gmail_quote"><br></div><div class="gmail_quote">-JR</div>