[DRBD-user] Weird DRBD performance problem

Lars Ellenberg lars.ellenberg at linbit.com
Thu Feb 2 15:11:36 CET 2012

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Wed, Feb 01, 2012 at 06:04:18PM -0700, Roof, Morey R. wrote:
> Hi Everyone,
>  
> I have a DRBD performance problem that has got me completely confused.
> I hoping that someone can help with this one as my other servers that
> use the same type of RAID cards and DRBD don't have this problem.
>  
> For the hardware, I have two Dell R515 servers with the H700 card,
> basically an LSI Megaraid based card, and running SLES 11 SP1.  This
> problem shows up on drbd 8.3.11, 8.3.12, and 8.4.1 but I haven't
> tested other versions.
>  
> here is the simple config I made based on the servers that don't have
> any issues:
>  
> global {
>         # We don't want to be bother by the usage count numbers
>         usage-count no;
> }
> common {
>         protocol C;
>         net {
>                 cram-hmac-alg           md5;
>                 shared-secret           "P4ss";
>         }
> }
> resource r0 {
>         on san1 {
>                 device                  /dev/drbd0;
>                 disk                    /dev/disk/by-id/scsi-36782bcb0698b6300167badae13f2884d-part2;
>                 address                 10.60.60.1:63000;
>                 flexible-meta-disk      /dev/disk/by-id/scsi-36782bcb0698b6300167badae13f2884d-part1;
>         }
>         on san2 {
>                 device                  /dev/drbd0;
>                 disk                    /dev/disk/by-id/scsi-36782bcb0698b6e00167bb1d107a77a47-part2;
>                 address                 10.60.60.2:63000;
>                 flexible-meta-disk      /dev/disk/by-id/scsi-36782bcb0698b6e00167bb1d107a77a47-part1;
>         }
>         startup {
>                 wfc-timeout             5;
>         }
>         syncer {
>                 rate                    50M;
>                 cpu-mask                4;
>         }
>         disk {
>                 on-io-error             detach;
>                 no-disk-barrier;
>                 no-disk-flushes;
>                 no-disk-drain;


Will people please STOP using no-disk-drain.  On most hardware, it does
not provide measurable performance gain, but may risk data integrity
because of potential violation of write-after-write dependencies!

>                 no-md-flushes;
>         }
> }
>  
> version: 8.3.11 (api:88/proto:86-96)
> GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by phil at fat-tyre <mailto:phil at fat-tyre> , 2011-06-29 11:37:11
>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----s
>     ns:0 nr:0 dw:8501248 dr:551 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:3397375600
>  
> So, when I'm running just with one server and no replication the performance hit with DRBD is huge.  The backing device shows a throughput of:
> ----
> san1:~ # dd if=/dev/zero of=/dev/disk/by-id/scsi-36782bcb0698b6300167badae13f2884d-part2 bs=1M count=16384

Hope you are not writing to the page cache only?
add oflag=direct, or oflag=dsync, or conv=fsync combinations thereof.

> san1:~ # dd if=/dev/zero of=/dev/drbd/by-res/r0 bs=1M count=16384
> 16384+0 records in
> 16384+0 records out
> 17179869184 bytes (17 GB) copied, 93.457 s, 184 MB/s

See if moving the drbd meta data to raid 1 helps.

> -------
>  
> using iostat I see part of the problem:
>  
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.08    0.00   16.76    0.00    0.00   83.17
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sda               0.00         0.00         0.00          0          0
> sdb           20565.00         0.00       360.00          0        719
> drbd0         737449.50         0.00       360.08          0        720
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.07    0.00   28.87    1.37    0.00   69.69
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sda               1.50         0.00         0.01          0          0
> sdb           57859.50         0.00       177.22          0        354
> drbd0         362787.00         0.00       177.14          0        354
>  
> the drbd device is showing a TPS about 10x - 20x of the backing store.
> When I do this on my other servers I don't see anything like it.  The
> working servers are also running the same kernel and drbd versions. 

The rest of the IO stack is the same as well, including driver,
firmware, settings, health of controller cache battery?
Not implying anything, that's just something to check...

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com



More information about the drbd-user mailing list