[DRBD-user] bonding more than two network cards still a bad idea?

J. Ryan Earl oss at jryanearl.us
Fri Oct 1 04:21:47 CEST 2010

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Thu, Sep 30, 2010 at 11:22 AM, Bart Coninckx <bart.coninckx at telenet.be>
 wrote:

> Hi all,
>
> I remember doing some research about bonding more than two network cards
> and
> having found that Linbit had shown that this does not really improve
> performance because of the TCP reordering.
>
> I was just wondering if this is still the case, provided more recent
> developments with new hardware and stuff.
>
> I just saw bonnie++ hit more than 250 MB/sec while my bonded gigabit gives
> me
> about 160 MB/sec with 30% TCP header penalty, so looking into this is
> useful.
>
> If not, I will be looking at 10Gb cards I guess ...
>

Hi there,

So I'll answer your direct question, and then I'll answer the question I
think you really want to know--what's the best interconnect for DRBD--as
I've been do a lot of testing in that area:

Results of bonding many GigE connections probably depends on your hardware.
 Look for an RDMA-based solution instead of hardware that requires
interrupts for passing network traffic: igb.ko based Intel NICs, bnx2.ko
based Broadcom NICs, etc.  You want to tune your TCP windows higher.
 There's supposed to be a provision in NAPI now that let's the bonding
driver tell the NIC not to not coalesce segments in hardware and to do TCP
segment coalesce at a higher layer to completely avoid the reordering
problem.  I found that whatever tuning I was already doing on the network
stack was enough to give me line-speed performance on a BCM5709C (common
onboard) with the bnx2 driver and a dual port setup.  Sustained 1.95Gbit,
resulted in 234MB/sec actual write throughput through DRBD.

I didn't try with 3-bonds because that's just my backup.

Part of the problem with 10Gbit Ethernet is that it can have latency higher
than regular GigE, and well implemented 10 Gbe and well implemented GigE
both have about the same latency.

I've been working on creating a respectably high-performance DRBD setup.
 I've tested Dolphin Express (DX) with SuperSockets, QDR Infiniband, and 10
GbE (VPI Ethernet mode).  Dolphin Express I think is great if it's
compatible with your server, for whatever reason it was not compatible with
my intended production gear.  It's primary advantages are transparent
fail-over to Ethernet, transparently redundant links when put back-to-back
(ie 2 links merged into 1 fabric), and a well optimized software stack in
general.  It is also well supported by DRBD.  If you don't need more than
500-600 MB/sec sustained write throughput, I think Dolphin is great.

QDR (40Gbit) InfiniBand's primary advantages are raw performance,
flexibility, wide-spread adoption, and shear scalability.  It is more
enterprise ready and may be better supported on your server hardware.  On
the downside, it doesn't have transparent fail-over of any sort in a
back-to-back configuration; it can neither fail-over transparently between
IB ports or to a backup Ethernet interconnect.  IB ports bond together only
in active-passive mode, and only if they are on the same fabric.  In a
back-to-back configuration each connected port-pair is separate fabric so
ib-bonding doesn't work back-to-back as there are 2 fabrics in play.

Anyway, here are some numbers.  Unfortunately, I don't have any pure
throughput numbers from within kernel-space which is what matters to DRBD.
 Interestingly enough, kernel-space socket performance can differ by quite a
lot from user-space socket performance.

Userspace netperf:

2-GigE bonded balance-rr
[root at node02 ~]# netperf -f g -H 192.168.90.1 -c -C
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.90.1
(192.168.90.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local
remote
bytes  bytes   bytes    secs.    10^9bits/s  % S      % S      us/KB   us/KB

 87380  65536  65536    10.04         1.95   0.75     0.85     0.757   0.855


Dolphin Express (single 10Gbit DX port) SuperSockets = RDMA:
[root at node02 network-scripts]# LD_PRELOAD="libksupersockets.so" netperf -f g
-H 192.168.90.1 -c -C
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.90.1
(192.168.90.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local
remote
bytes  bytes   bytes    secs.    10^9bits/s  % S      % S      us/KB   us/KB

129024  65536  65536    10.01         6.53   1.48     1.46     0.444   0.439


QDR IB (1 port) IPoIB
root at node02 log]# netperf -f g -H 192.168.20.1 -c -C
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.20.1
(192.168.20.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local
remote
bytes  bytes   bytes    secs.    10^9bits/s  % S      % S      us/KB   us/KB

 87380  65536  65536    10.00        16.15   1.74     4.61     0.211   0.562

QDR IB (1 port) SDP (SocketDirect Protocol = RDMA)
[root at node02 log]# LD_PRELOAD="libsdp.so" netperf -f g -H 192.168.20.1 -c -C
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.20.1
(192.168.20.1) port 0 AF_INET
Recv   Send    Send                          Utilization       Service
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local
remote
bytes  bytes   bytes    secs.    10^9bits/s  % S      % S      us/KB   us/KB

 87380  65536  65536    10.01        24.67   3.18     3.28     0.253   0.262


Userspace SDP above does the best at 24.67 Gbit/s, IPoIB is slower at 16.15
Gbit/s.  However, I have not been able to get DRBD + SDP to perform anyway
near as well as DRBD + IPoIB, which is interesting.  DRBD + SDP maxes out
around 400-450MB/sec write and resync speed.  With IPoIB I'm getting ~620
MB/sec sync with sustained writes at 720MB/sec.  Interestingly, resync speed
was "only" 620MB/sec.

# time dd if=/dev/zero bs=2048M count=20 of=/dev/drbd0
0+20 records in
0+20 records out
42949591040 bytes (43 GB) copied, 59.4198 seconds, 723 MB/s

At this point, single-thread performance is my bottleneck.  The above is
with a Xeon X5650s, but I expect the X5680s in production gear will do DRBD
writes >800MB/sec.  My backing storage is capable of 900MB/s throughput so I
think I could reasonably get about 90% of that throughput.

The IB HCA's I'm using for VPI (Virtual Protocol Interface) which means they
can be put into different encapsulation modes, ie InfiniBand or 10 GbE.
 Running in 10GbE mode, my write throughput was in the 300-400MB/sec range,
same with resync speed.  Running the adapter in IB mode with
IP-over-InfiniBand (IPoIB) gave a substantial increase in performance at the
cost of running an opensmd instance.  Dolphin DX with SuperSockets
outperforms raw 10GbE as well.

What kind of write throughput are you looking for?

-JR
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20100930/a4556274/attachment.htm>


More information about the drbd-user mailing list