[DRBD-user] Weird DRBD performance problem

Thu Feb 2 02:04:18 CET 2012

Hi Everyone,

I have a DRBD performance problem that has got me completely confused.  I hoping that someone can help with this one as my other servers that use the same type of RAID cards and DRBD don't have this problem.

For the hardware, I have two Dell R515 servers with the H700 card, basically an LSI Megaraid based card, and running SLES 11 SP1.  This problem shows up on drbd 8.3.11, 8.3.12, and 8.4.1 but I haven't tested other versions.

here is the simple config I made based on the servers that don't have any issues:

global {
        # We don't want to be bother by the usage count numbers
        usage-count no;
}
common {
        protocol C;
        net {
                cram-hmac-alg           md5;
                shared-secret           "P4ss";
        }
}
resource r0 {
        on san1 {
                device                  /dev/drbd0;
                disk                    /dev/disk/by-id/scsi-36782bcb0698b6300167badae13f2884d-part2;
                address                 10.60.60.1:63000;
                flexible-meta-disk      /dev/disk/by-id/scsi-36782bcb0698b6300167badae13f2884d-part1;
        }
        on san2 {
                device                  /dev/drbd0;
                disk                    /dev/disk/by-id/scsi-36782bcb0698b6e00167bb1d107a77a47-part2;
                address                 10.60.60.2:63000;
                flexible-meta-disk      /dev/disk/by-id/scsi-36782bcb0698b6e00167bb1d107a77a47-part1;
        }
        startup {
                wfc-timeout             5;
        }
        syncer {
                rate                    50M;
                cpu-mask                4;
        }
        disk {
                on-io-error             detach;
                no-disk-barrier;
                no-disk-flushes;
                no-disk-drain;
                no-md-flushes;
        }
}

version: 8.3.11 (api:88/proto:86-96)
GIT-hash: 0de839cee13a4160eed6037c4bddd066645e23c5 build by phil at fat-tyre <mailto:phil at fat-tyre> , 2011-06-29 11:37:11
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----s
    ns:0 nr:0 dw:8501248 dr:551 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:n oos:3397375600

So, when I'm running just with one server and no replication the performance hit with DRBD is huge.  The backing device shows a throughput of:
----
san1:~ # dd if=/dev/zero of=/dev/disk/by-id/scsi-36782bcb0698b6300167badae13f2884d-part2 bs=1M count=16384
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB) copied, 16.4434 s, 1.0 GB/s
----
san1:~ # dd if=/dev/zero of=/dev/drbd/by-res/r0 bs=1M count=16384
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB) copied, 93.457 s, 184 MB/s
-------

using iostat I see part of the problem:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.08    0.00   16.76    0.00    0.00   83.17
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda               0.00         0.00         0.00          0          0
sdb           20565.00         0.00       360.00          0        719
drbd0         737449.50         0.00       360.08          0        720
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.07    0.00   28.87    1.37    0.00   69.69
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda               1.50         0.00         0.01          0          0
sdb           57859.50         0.00       177.22          0        354
drbd0         362787.00         0.00       177.14          0        354

the drbd device is showing a TPS about 10x - 20x of the backing store.  When I do this on my other servers I don't see anything like it.  The working servers are also running the same kernel and drbd versions. 

Does anyone have any ideas of how this might be resolved or fixed?  I'm at a loss right now.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120201/9828aca2/attachment.htm>