[DRBD-user] bonding more than two network cards still a bad idea?
Bart Coninckx
bart.coninckx at telenet.be
Mon Oct 4 18:13:10 CEST 2010
On Friday 01 October 2010 04:21:47 J. Ryan Earl wrote:
> On Thu, Sep 30, 2010 at 11:22 AM, Bart Coninckx <bart.coninckx at telenet.be>
>
> wrote:
> > Hi all,
> >
> > I remember doing some research about bonding more than two network cards
> > and
> > having found that Linbit had shown that this does not really improve
> > performance because of the TCP reordering.
> >
> > I was just wondering if this is still the case, provided more recent
> > developments with new hardware and stuff.
> >
> > I just saw bonnie++ hit more than 250 MB/sec while my bonded gigabit
> > gives me
> > about 160 MB/sec with 30% TCP header penalty, so looking into this is
> > useful.
> >
> > If not, I will be looking at 10Gb cards I guess ...
>
> Hi there,
>
> So I'll answer your direct question, and then I'll answer the question I
> think you really want to know--what's the best interconnect for DRBD--as
> I've been do a lot of testing in that area:
>
> Results of bonding many GigE connections probably depends on your hardware.
> Look for an RDMA-based solution instead of hardware that requires
> interrupts for passing network traffic: igb.ko based Intel NICs, bnx2.ko
> based Broadcom NICs, etc. You want to tune your TCP windows higher.
> There's supposed to be a provision in NAPI now that let's the bonding
> driver tell the NIC not to not coalesce segments in hardware and to do TCP
> segment coalesce at a higher layer to completely avoid the reordering
> problem. I found that whatever tuning I was already doing on the network
> stack was enough to give me line-speed performance on a BCM5709C (common
> onboard) with the bnx2 driver and a dual port setup. Sustained 1.95Gbit,
> resulted in 234MB/sec actual write throughput through DRBD.
>
> I didn't try with 3-bonds because that's just my backup.
>
> Part of the problem with 10Gbit Ethernet is that it can have latency higher
> than regular GigE, and well implemented 10 Gbe and well implemented GigE
> both have about the same latency.
>
> I've been working on creating a respectably high-performance DRBD setup.
> I've tested Dolphin Express (DX) with SuperSockets, QDR Infiniband, and 10
> GbE (VPI Ethernet mode). Dolphin Express I think is great if it's
> compatible with your server, for whatever reason it was not compatible with
> my intended production gear. It's primary advantages are transparent
> fail-over to Ethernet, transparently redundant links when put back-to-back
> (ie 2 links merged into 1 fabric), and a well optimized software stack in
> general. It is also well supported by DRBD. If you don't need more than
> 500-600 MB/sec sustained write throughput, I think Dolphin is great.
>
> QDR (40Gbit) InfiniBand's primary advantages are raw performance,
> flexibility, wide-spread adoption, and shear scalability. It is more
> enterprise ready and may be better supported on your server hardware. On
> the downside, it doesn't have transparent fail-over of any sort in a
> back-to-back configuration; it can neither fail-over transparently between
> IB ports or to a backup Ethernet interconnect. IB ports bond together only
> in active-passive mode, and only if they are on the same fabric. In a
> back-to-back configuration each connected port-pair is separate fabric so
> ib-bonding doesn't work back-to-back as there are 2 fabrics in play.
>
> Anyway, here are some numbers. Unfortunately, I don't have any pure
> throughput numbers from within kernel-space which is what matters to DRBD.
> Interestingly enough, kernel-space socket performance can differ by quite
> a lot from user-space socket performance.
>
> Userspace netperf:
>
> 2-GigE bonded balance-rr
> [root at node02 ~]# netperf -f g -H 192.168.90.1 -c -C
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.90.1
> (192.168.90.1) port 0 AF_INET
> Recv Send Send Utilization Service
> Demand
> Socket Socket Message Elapsed Send Recv Send Recv
> Size Size Size Time Throughput local remote local
> remote
> bytes bytes bytes secs. 10^9bits/s % S % S us/KB
> us/KB
>
> 87380 65536 65536 10.04 1.95 0.75 0.85 0.757
> 0.855
>
>
> Dolphin Express (single 10Gbit DX port) SuperSockets = RDMA:
> [root at node02 network-scripts]# LD_PRELOAD="libksupersockets.so" netperf -f
> g -H 192.168.90.1 -c -C
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.90.1
> (192.168.90.1) port 0 AF_INET
> Recv Send Send Utilization Service
> Demand
> Socket Socket Message Elapsed Send Recv Send Recv
> Size Size Size Time Throughput local remote local
> remote
> bytes bytes bytes secs. 10^9bits/s % S % S us/KB
> us/KB
>
> 129024 65536 65536 10.01 6.53 1.48 1.46 0.444
> 0.439
>
>
> QDR IB (1 port) IPoIB
> root at node02 log]# netperf -f g -H 192.168.20.1 -c -C
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.20.1
> (192.168.20.1) port 0 AF_INET
> Recv Send Send Utilization Service
> Demand
> Socket Socket Message Elapsed Send Recv Send Recv
> Size Size Size Time Throughput local remote local
> remote
> bytes bytes bytes secs. 10^9bits/s % S % S us/KB
> us/KB
>
> 87380 65536 65536 10.00 16.15 1.74 4.61 0.211
> 0.562
>
> QDR IB (1 port) SDP (SocketDirect Protocol = RDMA)
> [root at node02 log]# LD_PRELOAD="libsdp.so" netperf -f g -H 192.168.20.1 -c
> -C TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.20.1
> (192.168.20.1) port 0 AF_INET
> Recv Send Send Utilization Service
> Demand
> Socket Socket Message Elapsed Send Recv Send Recv
> Size Size Size Time Throughput local remote local
> remote
> bytes bytes bytes secs. 10^9bits/s % S % S us/KB
> us/KB
>
> 87380 65536 65536 10.01 24.67 3.18 3.28 0.253
> 0.262
>
>
> Userspace SDP above does the best at 24.67 Gbit/s, IPoIB is slower at 16.15
> Gbit/s. However, I have not been able to get DRBD + SDP to perform anyway
> near as well as DRBD + IPoIB, which is interesting. DRBD + SDP maxes out
> around 400-450MB/sec write and resync speed. With IPoIB I'm getting ~620
> MB/sec sync with sustained writes at 720MB/sec. Interestingly, resync
> speed was "only" 620MB/sec.
>
> # time dd if=/dev/zero bs=2048M count=20 of=/dev/drbd0
> 0+20 records in
> 0+20 records out
> 42949591040 bytes (43 GB) copied, 59.4198 seconds, 723 MB/s
>
> At this point, single-thread performance is my bottleneck. The above is
> with a Xeon X5650s, but I expect the X5680s in production gear will do DRBD
> writes >800MB/sec. My backing storage is capable of 900MB/s throughput so
> I think I could reasonably get about 90% of that throughput.
>
> The IB HCA's I'm using for VPI (Virtual Protocol Interface) which means
> they can be put into different encapsulation modes, ie InfiniBand or 10
> GbE. Running in 10GbE mode, my write throughput was in the 300-400MB/sec
> range, same with resync speed. Running the adapter in IB mode with
> IP-over-InfiniBand (IPoIB) gave a substantial increase in performance at
> the cost of running an opensmd instance. Dolphin DX with SuperSockets
> outperforms raw 10GbE as well.
>
> What kind of write throughput are you looking for?
>
> -JR
JR,
thank you for this very elaborate and technically rich reply. I will certainly
look into your suggestions about using Broadcom cards. I have one dual port
Broadcom card in this server, but I was using one port combined with one port
on an Intel e1000 dual port NIC in balanced-rr to provide for backup in the
event a NIC goes down. Two port NICs usually share one chip for two ports, so
in case of a problem with the chip, the complete DRBD would be out. Reality
shows this might be a bad idea though: doing a bonnie++ test to the backend
storage (RAID5 on 15K rpm disks) gives me a 255 MB/sec write performance,
doing the same test on the DRBD device drops this to 77 MB/sec, even with the
MTU set to 9000. It would be nice to get as close as possible to the
theoretical maximum, so a lot needs to be done to get there.
Step 1 would be changing everything to the broadcom NIC. Any other
suggestions?
Thanks a lot,
Bart
More information about the drbd-user
mailing list