[DRBD-user] DRBD serious locking due to TOE

Fri Dec 14 11:17:34 CET 2007

Dear DRBD,

I have a repeatable problem with DRBD 8.2.1 where it locks up, and the 
replication ability falls by several orders of magnitude.  This is the 
same as the problem reported by Ben Lavender on 2007-08-29.

Ben identified the problem as due to the TOE protocol on his DELL 
network card.  Our HP network cards (NetXtreme II BCM5708 1000Base-SX) 
use the same Broadcom chipset, but unlike the DELL card, the HP card 
provides no mechanism to disable TOE.  Or at least no published 
mechanism in the BIOS or available to Linux, and no jumpers on the PCB.

The problem occurs under heavy loading.  The NIC's ability to handle TCP 
packets falls to about a tenth of it's normal rate, which is normally 
100MB/sec on our set-up.  Therefore rendering DRBD and our MySql 
database unusable for a few minutes.

I would like to ask if there is anything that can be done in DRBD to get 
round this problem, like for instance using UDP instead of TCP, or some 
bug-fix for TOE which any member may know about?

If this is not the case we will have to replace our NIC's, which is 
really not something we want to do, since all HP NIC's for HP servers 
seem to have the same chipset.

Any advise would be extremely welcome!

Regards,

Ben Clewett.

System details:

SUSE 10.2, Linux 2.6.18.2, 64bit quad-processor, 10GB memory on HP 
Proliant DL380 G5.  Disk IO rate ~ 250 MB/sec, replication on dedicated 
1000baseSX.

version: 8.2.1 (api:86/proto:86-87)
GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by 
root at hp-tm-02, 2007-12-10 22:21:14
  0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate B r---
     ns:1053120 nr:28862256 dw:28864144 dr:1065701 al:46 bm:392 lo:1 
pe:0 ua:1 ap:0
         resync: used:0/31 hits:65607 misses:217 starving:0 dirty:0 
changed:217
         act_log: used:1/257 hits:427 misses:46 starving:0 dirty:0 
changed:46
  1: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate B r---
     ns:41783720 nr:1053224 dw:42836948 dr:7104597 al:144980 bm:422 lo:0 
pe:0 ua:0 ap:0
         resync: used:0/31 hits:65595 misses:213 starving:0 dirty:0 
changed:213
         act_log: used:1/257 hits:10300951 misses:145315 starving:0 
dirty:335 changed:144980

global {
     # minor-count 64;
     # dialog-refresh 5; # 5 seconds
     # disable-ip-verification;
     usage-count no;
}

common {

   handlers {
     pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
     pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
     local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
     outdate-peer "/usr/sbin/drbd-peer-outdater";
   }

   startup {
     # Default is 0, which means unlimited. Unit is seconds.
     # wfc-timeout  0;

     # Wait for connection timeout if this node was a degraded cluster.
     degr-wfc-timeout 120;    # 2 minutes.
   }

   disk {
     on-io-error   detach;
     # fencing resource-only;
   }

   net {
     # this is the size of the tcp socket send buffer
     # increase it _carefully_ if you want to use protocol A over a
     # high latency network with reasonable write throughput.
     # defaults to 2*65535; you might try even 1M, but if your kernel or
     # network driver chokes on that, you have been warned.
     # sndbuf-size 512k;

     # timeout       60;    #  6 seconds  (unit = 0.1 seconds)
     # connect-int   10;    # 10 seconds  (unit = 1 second)
     # ping-int      10;    # 10 seconds  (unit = 1 second)
     # ping-timeout   5;    # 500 ms (unit = 0.1 seconds)

     # Maximal number of requests (4K) to be allocated by DRBD.
     # The minimum is hardcoded to 32 (=128 kByte).
     # For high performance installations it might help if you
     # increase that number. These buffers are used to hold
     # datablocks while they are written to disk.
     #
     max-buffers     40000;

     # When the number of outstanding requests on a standby (secondary)
     # node exceeds bdev-threshold, we start to kick the backing device
     # to start its request processing. This is an advanced tuning
     # parameter to get more performance out of capable storage controlers.
     # Some controlers like to be kicked often, other controlers
     # deliver better performance when they are kicked less frequently.
     # Set it to the value of max-buffers to get the least possible
     # number of run_task_queue_disk() / q->unplug_fn(q) calls.
     #
     # unplug-watermark   128;

     # The highest number of data blocks between two write barriers.
     # If you set this < 10 you might decrease your performance.
     # max-epoch-size  2048;

     # if some block send times out this many times, the peer is
     # considered dead, even if it still answers ping requests.
     # ko-count 4;

     # If you want to use OCFS2/openGFS on top of DRBD enable
     # this optione, and only enable it if you are going to use
     # one of these filesystems. Do not enable it for ext2,
     # ext3,reiserFS,XFS,JFS etc...
     # allow-two-primaries;

     # This enables peer authentication. Without this everybody
     # on the network could connect to one of your DRBD nodes with
     # a program that emulates DRBD's protocoll and could suck off
     # all your data.
     # Specify one of the kernel's digest algorithms, e.g.:
     # md5, sha1, sha256, sha512, wp256, wp384, wp512, michael_mic ...
     # an a shared secret.
     # Authentication is only done once after the TCP connection
     # is establised, there are no disadvantages from using authentication,
     # therefore I suggest to enable it in any case.
     # cram-hmac-alg "sha1";
     # shared-secret "FooFunFactory";

     # In case the nodes of your cluster nodes see each other again, after
     # an split brain situation in which both nodes where primary
     # at the same time, you have two diverged versions of your data.
     #
     # In case both nodes are secondary you can control DRBD's
     # auto recovery strategy by the "after-sb-0pri" options. The
     # default is to disconnect.
     #    "disconnect" ... No automatic resynchronisation, simply 
disconnect.
     #    "discard-younger-primary"
     #                     Auto sync from the node that was primary before
     #                     the split brain situation happened.
     #    "discard-older-primary"
     #                     Auto sync from the node that became primary
     #                     as second during the split brain situation.
     #    "discard-least-changes"
     #                     Auto sync from the node that touched more
     #                     blocks during the split brain situation.
     #    "discard-node-NODENAME"
     #                     Auto sync _to_ the named node.
     after-sb-0pri disconnect;

     # In one of the nodes is already primary, then the auto-recovery
     # strategie is controled by the "after-sb-1pri" options.
     #    "disconnect" ... always disconnect
     #    "consensus"  ... discard the version of the secondary if the 
outcome
     #                     of the "after-sb-0pri" algorithm would also 
destroy
     #                     the current secondary's data. Otherwise 
disconnect.
     #    "violently-as0p" Always take the decission of the "after-sb-0pri"
     #                     algorithm. Even if that causes case an 
erratic change
     #                     of the primarie's view of the data.
     #                     This is only usefull if you use an 1node FS 
(i.e.
     #                     not OCFS2 or GFS) with the allow-two-primaries
     #                     flag, _AND_ you really know what you are doing.
     #                     This is DANGEROUS and MAY CRASH YOUR MACHINE 
if you
     #                     have a FS mounted on the primary node.
     #    "discard-secondary"
     #                     discard the version of the secondary.
     #    "call-pri-lost-after-sb"  Always honour the outcome of the 
"after-sb-0pri"
     #                     algorithm. In case it decides the the current
     #                     secondary has the right data, it panics the
     #                     current primary.
     #    "suspend-primary" ???
     after-sb-1pri disconnect;

     # In case both nodes are primary you control DRBD's strategy by
     # the "after-sb-2pri" option.
     #    "disconnect" ... Go to StandAlone mode on both sides.
     #    "violently-as0p" Always take the decission of the 
"after-sb-0pri".
     #    "call-pri-lost-after-sb" ... Honor the outcome of the 
"after-sb-0pri"
     #                     algorithm and panic the other node.

     after-sb-2pri disconnect;

     # To solve the cases when the outcome of the resync descissions is
     # incompatible to the current role asignment in the cluster.
     #    "disconnect" ... No automatic resynchronisation, simply 
disconnect.
     #    "violently" .... Sync to the primary node is allowed, 
violating the
     #                     assumption that data on a block device is stable
     #                     for one of the nodes. DANGEROUS, DO NOT USE.
     #    "call-pri-lost"  Call the "pri-lost" helper program on one of the
     #                     machines. This program is expected to reboot the
     #                     machine. (I.e. make it secondary.)
     rr-conflict disconnect;

     # DRBD-0.7's behaviour is equivalent to
     #   after-sb-0pri discard-younger-primary;
     #   after-sb-1pri consensus;
     #   after-sb-2pri disconnect;
   }

   syncer {
     # Limit the bandwith used by the resynchronisation process.
     # default unit is kByte/sec; optional suffixes K,M,G are allowed.
     #
     # Even though this is a network setting, the units are based
     # on _byte_ (octet for our french friends) not bit.
     # We are storage guys.
     #
     # Note that on 100Mbit ethernet, you cannot expect more than
     # 12.5 MByte total transfer rate.
     # Consider using GigaBit Ethernet.
     #
     rate 100M;

     # Configures the size of the active set. Each extent is 4M,
     # 257 Extents ~> 1GB active set size. In case your syncer
     # runs @ 10MB/sec, all resync after a primary's crash will last
     # 1GB / ( 10MB/sec ) ~ 102 seconds ~ One Minute and 42 Seconds.
     # BTW, the hash algorithm works best if the number of al-extents
     # is prime. (To test the worst case performace use a power of 2)
     al-extents 257;
   }

}

resource dbms-07-01 {

   protocol B;

   on hp-tm-02 {
     device     /dev/drbd0;
     disk       /dev/cciss/c0d0p5;
     address    192.168.95.5:7788;
     meta-disk  /dev/cciss/c0d0p3[0];
   }

   on hp-tm-04 {
     device    /dev/drbd0;
     disk      /dev/cciss/c0d0p5;
     address   192.168.95.6:7789;
     meta-disk /dev/cciss/c0d0p3[1];
   }
}

resource dbms-07-02 {

   protocol B;

   on hp-tm-02 {
     device     /dev/drbd1;
     disk       /dev/cciss/c0d0p6;
     address    192.168.95.5:7789;
     meta-disk  /dev/cciss/c0d0p3[1];
   }

   on hp-tm-04 {
     device    /dev/drbd1;
     disk      /dev/cciss/c0d0p6;
     address   192.168.95.6:7788;
     meta-disk /dev/cciss/c0d0p3[0];
   }
}

*************************************************************************
This e-mail is confidential and may be legally privileged. It is intended
solely for the use of the individual(s) to whom it is addressed. Any
content in this message is not necessarily a view or statement from Road
Tech Computer Systems Limited but is that of the individual sender. If
you are not the intended recipient, be advised that you have received
this e-mail in error and that any use, dissemination, forwarding,
printing, or copying of this e-mail is strictly prohibited. We use
reasonable endeavours to virus scan all e-mails leaving the company but
no warranty is given that this e-mail and any attachments are virus free.
You should undertake your own virus checking. The right to monitor e-mail
communications through our networks is reserved by us

  Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley,
  Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17
  Registered in England No: 02017435, Registered Address: Charter Court, 
  Midland Road, Hemel Hempstead,  Hertfordshire, HP2 5GE. 
*************************************************************************