[DRBD-user] Benchmark: DRBD over SDP over Infiniband on debian

Dr. Volker Jaenisch volker.jaenisch at inqbus.de
Thu Mar 1 01:18:35 CET 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi DRBD Users!

We are using DRBD over IP_over_Infiniband a couple of years. We love the flexibility and the design of infiniband
and we like DRBD as the very best realtime replication layer that never ever disappointed us.
Our aim is to replace our IP_over_IB based DRBD by DRBD over SDP.

We ported the recent OFED stack 1.5.4 to Debian squeezy to have a clean testing plattform for Infiniband and namely SDP.
The Debian Kernel 2.6.32 was build without any Infiniband modules. Against the Debian kernel sources we builded the
OFED-Stack 1.5.4 with minimal adjustments at the backport-patches.

The hardware basis is 40 GBit Infiniband Back to Back connecting on two 24 Core
AMD Supermicro Servers. A dedicated RAID 10 with 4 SAS HDDs for testing only does not interferre with the OS spindles.
The only coupling between the OS and the test spindles is using the same SAS controller.

We do not claim to be specialists on DRBD or Infiniband. We use both technologies in depth but we are no C-hackers.
So any of the findings presented here may be based on errernous assumptions and missing deeper knowledge.

At first we checked for the SDP network performance:

SDP works a at least a factor of three faster than IP_over_IB:

IP_over_IB

root at selene:~# iperf -p5555 -c 192.168.0.1 -t300
------------------------------------------------------------
Client connecting to 192.168.0.1, TCP port 5555
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.2 port 34606 connected with 192.168.0.1 port 5555
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-300.0 sec    138 GBytes  3.96 Gbits/sec


SDP:

root at selene:~# LD_PRELOAD=libsdp.so iperf -p5555 -c 192.168.0.1 -t300
------------------------------------------------------------
Client connecting to 192.168.0.1, TCP port 5555
TCP window size:   122 KByte (default)
------------------------------------------------------------
[  4] local 192.168.0.2 port 47408 connected with 192.168.0.1 port 5555
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-300.0 sec    513 GBytes  14.7 Gbits/sec


We benchmarked the bare local testing RAID 10 with an ext4 file system.
Four concurrent threads are manipulating each a 1 GByte Chunk of a 5 GByte Partition:

root at helios:/mnt/test# tiotest -t4 -f1000
Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        4000 MBs |   15.0 s | 265.894 MB/s |   9.0 %  | 289.1 % |
| Random Write   16 MBs |    1.0 s |  16.414 MB/s |   0.0 %  |  37.8 % |
| Read         4000 MBs |    0.5 s | 8134.463 MB/s | 256.3 %  | 1225.1 % |
| Random Read    16 MBs |    0.0 s | 6859.087 MB/s |   0.0 %  |   0.0 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10
sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.008 ms |        3.828 ms |  0.00000 |  0.00000 |
| Random Write |        0.006 ms |        0.040 ms |  0.00000 |  0.00000 |
| Read         |        0.002 ms |        0.069 ms |  0.00000 |  0.00000 |
| Random Read  |        0.002 ms |        0.007 ms |  0.00000 |  0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.005 ms |        3.828 ms |  0.00000 |  0.00000 |
`--------------+-----------------+-----------------+----------+-----------'


The read values are biassed heavily by io caching effects and will not take into account.
The write and random write rates are our baseline.

The DRBD config is simple

common {
        protocol C;

        handlers {
                pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b >
/proc/sysrq-trigger ; reboot -f";
                pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b >
/proc/sysrq-trigger ; reboot -f";
                local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ;
halt -f";
        }
}

resource r1
{                                                                                                                                                   

   syncer
{                                                                                                                                                     

       rate
300M;                                                                                                                                               

  
}                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                               

    on helios
{                                                                                                                                                 
        device
/dev/drbd1;                                                                                                                                      
        disk
/dev/sdb1;                                                                                                                                         
        meta-disk
internal;                                                                                                                                     
        address
192.168.0.1:7789;                                                                                                                               
#        address sdp
192.168.0.1:7789;                                                                                                                          
   
}                                                                                                                                                           

                                                                                                                                                                

    on selene
{                                                                                                                                                 
        device
/dev/drbd1;                                                                                                                                      
        disk
/dev/sdb1;                                                                                                                                         
        meta-disk
internal;                                                                                                                                     
        address
192.168.0.2:7789;                                                                                                                               
#        address sdp
192.168.0.2:7789;                                                                                                                          
                                                                                                                                                                

   
}                                                                                                                                                           

}                              
As you can see we only maipulated the SDP-Prefix of the address parameters,

Using IP_O_IB we obtain the following benchmark for DRBD :

/etc/init.d/drbd start
drbdadm primary r1
mount /dev/drbd1 /mnt/test/

root at helios:/mnt/test# tiotest -t4 -f1000

Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        4000 MBs |   43.6 s |  91.823 MB/s |   2.9 %  | 257.1 % |
| Random Write   16 MBs |    0.6 s |  26.557 MB/s |   7.5 %  |  91.8 % |
| Read         4000 MBs |    0.5 s | 8004.194 MB/s | 295.4 %  | 1163.9 % |
| Random Read    16 MBs |    0.0 s | 6755.296 MB/s | 691.7 %  | 691.7 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10
sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.007 ms |        0.106 ms |  0.00000 |  0.00000 |
| Random Write |        0.005 ms |        0.033 ms |  0.00000 |  0.00000 |
| Read         |        0.002 ms |        0.092 ms |  0.00000 |  0.00000 |
| Random Read  |        0.002 ms |        0.007 ms |  0.00000 |  0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.004 ms |        0.106 ms |  0.00000 |  0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

The linear write rate has dropped down to a third of the local baseline transfer rate.
The random write rate increased maybe due to buffering effects.
[We can tune up this write rate to 170 MB/sec using IP_over_IB connected mode, jumbo_64k frames etc. but we like
to present rates typical to the of the shelf debian user.]

Now we switched to the SDP protokol:

umount /mnt/test/
/etc/init.d/drbd stop
edit drbd.conf
/etc/init.d/drbd start
drbdadm primary r1
mount /dev/drbd1 /mnt/test/

root at helios:/mnt/test# tiotest -t4 -f1000
Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        4000 MBs |  243.8 s |  16.406 MB/s |   0.2 %  |  56.7 % |
| Random Write   16 MBs |    1.5 s |  10.162 MB/s |   1.0 %  |  45.0 % |
| Read         4000 MBs |    0.6 s | 6682.091 MB/s | 226.5 %  | 1287.1 % |
| Random Read    16 MBs |    0.0 s | 6811.247 MB/s |   0.0 %  | 2790.1 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10
sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.008 ms |        8.936 ms |  0.00000 |  0.00000 |
| Random Write |        0.006 ms |        0.049 ms |  0.00000 |  0.00000 |
| Read         |        0.002 ms |        0.088 ms |  0.00000 |  0.00000 |
| Random Read  |        0.002 ms |        0.010 ms |  0.00000 |  0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.005 ms |        8.936 ms |  0.00000 |  0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

The DRBD performance goes down heavily. Only one tenth of the baseline write rate is achieved.

Other authors noticed comparable but not similiar behavior:

http://old.nabble.com/Re%3A-DRBD-via-%28Infiniband%29-SDP%3A-fine-tuning-%28my-experience%29-td32273255.html
http://old.nabble.com/DRBD-over-Infiniband-%28SDP%29-performance-oddity-td32298261.html

We digged deep into the materia:
* The ib_sdp kernel module delivers an excellent performance using the libsdp.so userspace wrapper.
	Since the userspace wrapper does not modify any setting of the ib_sdp-kernel module we do not change the paramters of the module.
* DRBD comes with the magical switch no-tcp-cork 
	We switched the sdp-kernel module in debug mode an noticed that if no-tcp-cork is set hundreds of 
Feb 29 16:37:33 helios kernel: [104760.706931] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.706938] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.706942] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.706949] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.706953] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.706960] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.706964] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.706971] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.707565] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.707577] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.707583] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.707591] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.707595] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.707602] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
Feb 29 16:37:33 helios kernel: [104760.707607] sdp_setsockopt:1369 sdp_sock( 6503:0 7789:47457): sdp_setsockopt
	Messages dissapear. But it changed nothing in terms of IO Bandwidth.
* We checked for the parameter sndbuf-size and maximized it : No effect
* We do not change the IO-scheduler to deadline since the Scheduler resides in another layer idenpendent of the network stack we are testing.
	http://www.admin-magazin.de/Online-Artikel/Technical-Review/I-O-Scheduler-und-RAID-Performance

Locking at the code of the SDP-Wrapper a commet sprang into me eye:
libsdp-1.1.108/src/port.c:

        if ((ret >= 0) && (shadow_fd != -1)) {                                                                                                                   
                if (level == SOL_SOCKET && optname == SO_KEEPALIVE &&                                                                                            
                        get_is_sdp_socket(shadow_fd)) {                                                                                                          
                        level = AF_INET_SDP;                                                                                                                     
                        __sdp_log(2, "SETSOCKOPT: <%s:%d:%d> substitute level %d\n",                                                                             
                                          program_invocation_short_name, fd, shadow_fd, level);                                                                  
                }                                                                                                                                                
                                                                                                                                                                 
                sret = _socket_funcs.setsockopt(shadow_fd, level, optname, optval, optlen);                                                                      
                if (sret < 0) {                                                                                                                                  
                        __sdp_log(8, "Warning sockopts:"                                                                                                         
                                          " ignoring error on shadow SDP socket fd:<%d>\n", fd);                                                                 
                        /*                                                                                                                                       
                         * HACK: we should allow some errors as some sock opts are unsupported                                                                   
                         * __sdp_log(8, "Error %d calling setsockopt for SDP socket, closing\n", errno);                                                         
                         * cleanup_shadow(fd);                                                                                                                   
                         */                                                                                                                                      
                }                                                                                                                                                
        }                                                                                                                                                        
                                                                                                                                                                 
        /* Due to SDP limited implmentation of sockopts we ignore some errors */                                                                                 
        if ((ret < 0) && get_is_sdp_socket(fd) &&                                                                                                                
                is_filtered_unsuported_sockopt(level, optname)) {                                                                                                
                __sdp_log(8, "Warning sockopts: "                                                                                                                
                                  "ignoring error on non implemented sockopt on SDP socket"                                                                      
                                  " fd:<%d> level:<%d> opt:<%d>\n", fd, level, optval);                                                                          
                ret = 0;                                                                                                                                         
        }                                                                                                         
Obviously the SDP_wrapper ignores some Errors on setsockopt calls since not all sockoptions are implemented.

Does DRBD take this into account?
The only change in the DRBD-Code in handling SDP is to use AF_INET_SDP instead of AF_INET when creating a socket.
No code that attributes to SDP as a protocol with complete different metric can be found in the source of drbd. 
The Infiniband SDP-Handbook also does not require to do more to transform TCP Code to SDP.


Now to our questions:
* Are we complete useless idiots? This time  we feel like.
* Do we have done something obviously wrong?
If not
* Where can we dig in deeper?
* Who can direct us to do so?
* How can we test or inspect the DRBD code? 

We welcome all DRBD community members to help us to use our
wonderful testing setup with us to overcome the SDP-problems to make us
a better world with infiniband. 

Cheers,

Volker

-- 

====================================================
   inqbus GmbH & Co. KG      +49 ( 341 ) 60013031
   Dr.  Volker Jaenisch      http://www.inqbus.de
   Karl-Heine-Str.   99      0 4 2 2 9    Leipzig
   N  O  T -  F Ä L L E      +49 ( 170 )  3113748
====================================================




More information about the drbd-user mailing list