Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Friday 01 October 2010 04:21:47 J. Ryan Earl wrote: > On Thu, Sep 30, 2010 at 11:22 AM, Bart Coninckx <bart.coninckx at telenet.be> > > wrote: > > Hi all, > > > > I remember doing some research about bonding more than two network cards > > and > > having found that Linbit had shown that this does not really improve > > performance because of the TCP reordering. > > > > I was just wondering if this is still the case, provided more recent > > developments with new hardware and stuff. > > > > I just saw bonnie++ hit more than 250 MB/sec while my bonded gigabit > > gives me > > about 160 MB/sec with 30% TCP header penalty, so looking into this is > > useful. > > > > If not, I will be looking at 10Gb cards I guess ... > > Hi there, > > So I'll answer your direct question, and then I'll answer the question I > think you really want to know--what's the best interconnect for DRBD--as > I've been do a lot of testing in that area: > > Results of bonding many GigE connections probably depends on your hardware. > Look for an RDMA-based solution instead of hardware that requires > interrupts for passing network traffic: igb.ko based Intel NICs, bnx2.ko > based Broadcom NICs, etc. You want to tune your TCP windows higher. > There's supposed to be a provision in NAPI now that let's the bonding > driver tell the NIC not to not coalesce segments in hardware and to do TCP > segment coalesce at a higher layer to completely avoid the reordering > problem. I found that whatever tuning I was already doing on the network > stack was enough to give me line-speed performance on a BCM5709C (common > onboard) with the bnx2 driver and a dual port setup. Sustained 1.95Gbit, > resulted in 234MB/sec actual write throughput through DRBD. > > I didn't try with 3-bonds because that's just my backup. > > Part of the problem with 10Gbit Ethernet is that it can have latency higher > than regular GigE, and well implemented 10 Gbe and well implemented GigE > both have about the same latency. > > I've been working on creating a respectably high-performance DRBD setup. > I've tested Dolphin Express (DX) with SuperSockets, QDR Infiniband, and 10 > GbE (VPI Ethernet mode). Dolphin Express I think is great if it's > compatible with your server, for whatever reason it was not compatible with > my intended production gear. It's primary advantages are transparent > fail-over to Ethernet, transparently redundant links when put back-to-back > (ie 2 links merged into 1 fabric), and a well optimized software stack in > general. It is also well supported by DRBD. If you don't need more than > 500-600 MB/sec sustained write throughput, I think Dolphin is great. > > QDR (40Gbit) InfiniBand's primary advantages are raw performance, > flexibility, wide-spread adoption, and shear scalability. It is more > enterprise ready and may be better supported on your server hardware. On > the downside, it doesn't have transparent fail-over of any sort in a > back-to-back configuration; it can neither fail-over transparently between > IB ports or to a backup Ethernet interconnect. IB ports bond together only > in active-passive mode, and only if they are on the same fabric. In a > back-to-back configuration each connected port-pair is separate fabric so > ib-bonding doesn't work back-to-back as there are 2 fabrics in play. > > Anyway, here are some numbers. Unfortunately, I don't have any pure > throughput numbers from within kernel-space which is what matters to DRBD. > Interestingly enough, kernel-space socket performance can differ by quite > a lot from user-space socket performance. > > Userspace netperf: > > 2-GigE bonded balance-rr > [root at node02 ~]# netperf -f g -H 192.168.90.1 -c -C > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.90.1 > (192.168.90.1) port 0 AF_INET > Recv Send Send Utilization Service > Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local > remote > bytes bytes bytes secs. 10^9bits/s % S % S us/KB > us/KB > > 87380 65536 65536 10.04 1.95 0.75 0.85 0.757 > 0.855 > > > Dolphin Express (single 10Gbit DX port) SuperSockets = RDMA: > [root at node02 network-scripts]# LD_PRELOAD="libksupersockets.so" netperf -f > g -H 192.168.90.1 -c -C > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.90.1 > (192.168.90.1) port 0 AF_INET > Recv Send Send Utilization Service > Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local > remote > bytes bytes bytes secs. 10^9bits/s % S % S us/KB > us/KB > > 129024 65536 65536 10.01 6.53 1.48 1.46 0.444 > 0.439 > > > QDR IB (1 port) IPoIB > root at node02 log]# netperf -f g -H 192.168.20.1 -c -C > TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.20.1 > (192.168.20.1) port 0 AF_INET > Recv Send Send Utilization Service > Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local > remote > bytes bytes bytes secs. 10^9bits/s % S % S us/KB > us/KB > > 87380 65536 65536 10.00 16.15 1.74 4.61 0.211 > 0.562 > > QDR IB (1 port) SDP (SocketDirect Protocol = RDMA) > [root at node02 log]# LD_PRELOAD="libsdp.so" netperf -f g -H 192.168.20.1 -c > -C TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.20.1 > (192.168.20.1) port 0 AF_INET > Recv Send Send Utilization Service > Demand > Socket Socket Message Elapsed Send Recv Send Recv > Size Size Size Time Throughput local remote local > remote > bytes bytes bytes secs. 10^9bits/s % S % S us/KB > us/KB > > 87380 65536 65536 10.01 24.67 3.18 3.28 0.253 > 0.262 > > > Userspace SDP above does the best at 24.67 Gbit/s, IPoIB is slower at 16.15 > Gbit/s. However, I have not been able to get DRBD + SDP to perform anyway > near as well as DRBD + IPoIB, which is interesting. DRBD + SDP maxes out > around 400-450MB/sec write and resync speed. With IPoIB I'm getting ~620 > MB/sec sync with sustained writes at 720MB/sec. Interestingly, resync > speed was "only" 620MB/sec. > > # time dd if=/dev/zero bs=2048M count=20 of=/dev/drbd0 > 0+20 records in > 0+20 records out > 42949591040 bytes (43 GB) copied, 59.4198 seconds, 723 MB/s > > At this point, single-thread performance is my bottleneck. The above is > with a Xeon X5650s, but I expect the X5680s in production gear will do DRBD > writes >800MB/sec. My backing storage is capable of 900MB/s throughput so > I think I could reasonably get about 90% of that throughput. > > The IB HCA's I'm using for VPI (Virtual Protocol Interface) which means > they can be put into different encapsulation modes, ie InfiniBand or 10 > GbE. Running in 10GbE mode, my write throughput was in the 300-400MB/sec > range, same with resync speed. Running the adapter in IB mode with > IP-over-InfiniBand (IPoIB) gave a substantial increase in performance at > the cost of running an opensmd instance. Dolphin DX with SuperSockets > outperforms raw 10GbE as well. > > What kind of write throughput are you looking for? > > -JR JR, thank you for this very elaborate and technically rich reply. I will certainly look into your suggestions about using Broadcom cards. I have one dual port Broadcom card in this server, but I was using one port combined with one port on an Intel e1000 dual port NIC in balanced-rr to provide for backup in the event a NIC goes down. Two port NICs usually share one chip for two ports, so in case of a problem with the chip, the complete DRBD would be out. Reality shows this might be a bad idea though: doing a bonnie++ test to the backend storage (RAID5 on 15K rpm disks) gives me a 255 MB/sec write performance, doing the same test on the DRBD device drops this to 77 MB/sec, even with the MTU set to 9000. It would be nice to get as close as possible to the theoretical maximum, so a lot needs to be done to get there. Step 1 would be changing everything to the broadcom NIC. Any other suggestions? Thanks a lot, Bart