[DRBD-user] bonding more than two network cards still a bad idea?

Bart Coninckx bart.coninckx at telenet.be
Mon Oct 4 18:13:10 CEST 2010

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Friday 01 October 2010 04:21:47 J. Ryan Earl wrote:
> On Thu, Sep 30, 2010 at 11:22 AM, Bart Coninckx <bart.coninckx at telenet.be>
> 
>  wrote:
> > Hi all,
> > 
> > I remember doing some research about bonding more than two network cards
> > and
> > having found that Linbit had shown that this does not really improve
> > performance because of the TCP reordering.
> > 
> > I was just wondering if this is still the case, provided more recent
> > developments with new hardware and stuff.
> > 
> > I just saw bonnie++ hit more than 250 MB/sec while my bonded gigabit
> > gives me
> > about 160 MB/sec with 30% TCP header penalty, so looking into this is
> > useful.
> > 
> > If not, I will be looking at 10Gb cards I guess ...
> 
> Hi there,
> 
> So I'll answer your direct question, and then I'll answer the question I
> think you really want to know--what's the best interconnect for DRBD--as
> I've been do a lot of testing in that area:
> 
> Results of bonding many GigE connections probably depends on your hardware.
>  Look for an RDMA-based solution instead of hardware that requires
> interrupts for passing network traffic: igb.ko based Intel NICs, bnx2.ko
> based Broadcom NICs, etc.  You want to tune your TCP windows higher.
>  There's supposed to be a provision in NAPI now that let's the bonding
> driver tell the NIC not to not coalesce segments in hardware and to do TCP
> segment coalesce at a higher layer to completely avoid the reordering
> problem.  I found that whatever tuning I was already doing on the network
> stack was enough to give me line-speed performance on a BCM5709C (common
> onboard) with the bnx2 driver and a dual port setup.  Sustained 1.95Gbit,
> resulted in 234MB/sec actual write throughput through DRBD.
> 
> I didn't try with 3-bonds because that's just my backup.
> 
> Part of the problem with 10Gbit Ethernet is that it can have latency higher
> than regular GigE, and well implemented 10 Gbe and well implemented GigE
> both have about the same latency.
> 
> I've been working on creating a respectably high-performance DRBD setup.
>  I've tested Dolphin Express (DX) with SuperSockets, QDR Infiniband, and 10
> GbE (VPI Ethernet mode).  Dolphin Express I think is great if it's
> compatible with your server, for whatever reason it was not compatible with
> my intended production gear.  It's primary advantages are transparent
> fail-over to Ethernet, transparently redundant links when put back-to-back
> (ie 2 links merged into 1 fabric), and a well optimized software stack in
> general.  It is also well supported by DRBD.  If you don't need more than
> 500-600 MB/sec sustained write throughput, I think Dolphin is great.
> 
> QDR (40Gbit) InfiniBand's primary advantages are raw performance,
> flexibility, wide-spread adoption, and shear scalability.  It is more
> enterprise ready and may be better supported on your server hardware.  On
> the downside, it doesn't have transparent fail-over of any sort in a
> back-to-back configuration; it can neither fail-over transparently between
> IB ports or to a backup Ethernet interconnect.  IB ports bond together only
> in active-passive mode, and only if they are on the same fabric.  In a
> back-to-back configuration each connected port-pair is separate fabric so
> ib-bonding doesn't work back-to-back as there are 2 fabrics in play.
> 
> Anyway, here are some numbers.  Unfortunately, I don't have any pure
> throughput numbers from within kernel-space which is what matters to DRBD.
>  Interestingly enough, kernel-space socket performance can differ by quite
> a lot from user-space socket performance.
> 
> Userspace netperf:
> 
> 2-GigE bonded balance-rr
> [root at node02 ~]# netperf -f g -H 192.168.90.1 -c -C
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.90.1
> (192.168.90.1) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service
> Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local
> remote
> bytes  bytes   bytes    secs.    10^9bits/s  % S      % S      us/KB  
> us/KB
> 
>  87380  65536  65536    10.04         1.95   0.75     0.85     0.757  
> 0.855
> 
> 
> Dolphin Express (single 10Gbit DX port) SuperSockets = RDMA:
> [root at node02 network-scripts]# LD_PRELOAD="libksupersockets.so" netperf -f
> g -H 192.168.90.1 -c -C
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.90.1
> (192.168.90.1) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service
> Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local
> remote
> bytes  bytes   bytes    secs.    10^9bits/s  % S      % S      us/KB  
> us/KB
> 
> 129024  65536  65536    10.01         6.53   1.48     1.46     0.444  
> 0.439
> 
> 
> QDR IB (1 port) IPoIB
> root at node02 log]# netperf -f g -H 192.168.20.1 -c -C
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.20.1
> (192.168.20.1) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service
> Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local
> remote
> bytes  bytes   bytes    secs.    10^9bits/s  % S      % S      us/KB  
> us/KB
> 
>  87380  65536  65536    10.00        16.15   1.74     4.61     0.211  
> 0.562
> 
> QDR IB (1 port) SDP (SocketDirect Protocol = RDMA)
> [root at node02 log]# LD_PRELOAD="libsdp.so" netperf -f g -H 192.168.20.1 -c
> -C TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.20.1
> (192.168.20.1) port 0 AF_INET
> Recv   Send    Send                          Utilization       Service
> Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local
> remote
> bytes  bytes   bytes    secs.    10^9bits/s  % S      % S      us/KB  
> us/KB
> 
>  87380  65536  65536    10.01        24.67   3.18     3.28     0.253  
> 0.262
> 
> 
> Userspace SDP above does the best at 24.67 Gbit/s, IPoIB is slower at 16.15
> Gbit/s.  However, I have not been able to get DRBD + SDP to perform anyway
> near as well as DRBD + IPoIB, which is interesting.  DRBD + SDP maxes out
> around 400-450MB/sec write and resync speed.  With IPoIB I'm getting ~620
> MB/sec sync with sustained writes at 720MB/sec.  Interestingly, resync
> speed was "only" 620MB/sec.
> 
> # time dd if=/dev/zero bs=2048M count=20 of=/dev/drbd0
> 0+20 records in
> 0+20 records out
> 42949591040 bytes (43 GB) copied, 59.4198 seconds, 723 MB/s
> 
> At this point, single-thread performance is my bottleneck.  The above is
> with a Xeon X5650s, but I expect the X5680s in production gear will do DRBD
> writes >800MB/sec.  My backing storage is capable of 900MB/s throughput so
> I think I could reasonably get about 90% of that throughput.
> 
> The IB HCA's I'm using for VPI (Virtual Protocol Interface) which means
> they can be put into different encapsulation modes, ie InfiniBand or 10
> GbE. Running in 10GbE mode, my write throughput was in the 300-400MB/sec
> range, same with resync speed.  Running the adapter in IB mode with
> IP-over-InfiniBand (IPoIB) gave a substantial increase in performance at
> the cost of running an opensmd instance.  Dolphin DX with SuperSockets
> outperforms raw 10GbE as well.
> 
> What kind of write throughput are you looking for?
> 
> -JR

JR,

thank you for this very elaborate and technically rich reply. I will certainly 
look into your suggestions about using Broadcom cards. I have one dual port 
Broadcom card in this server, but I was using one port combined with one port 
on an Intel e1000 dual port NIC in balanced-rr to provide for backup in the 
event a NIC goes down. Two port NICs usually share one chip for two ports, so 
in case of a problem with the chip, the complete DRBD would be out. Reality 
shows this might be a bad idea though: doing a bonnie++ test to the backend 
storage (RAID5 on 15K rpm disks) gives me a 255 MB/sec write performance, 
doing the same test on the DRBD device drops this to 77 MB/sec, even with the 
MTU set to 9000. It would be nice to get as close as possible to the 
theoretical maximum, so a lot needs to be done to get there.
Step 1 would be changing everything to the broadcom NIC. Any other 
suggestions? 

Thanks a lot,

Bart



More information about the drbd-user mailing list