Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Dear DRBD, I have a repeatable problem with DRBD 8.2.1 where it locks up, and the replication ability falls by several orders of magnitude. This is the same as the problem reported by Ben Lavender on 2007-08-29. Ben identified the problem as due to the TOE protocol on his DELL network card. Our HP network cards (NetXtreme II BCM5708 1000Base-SX) use the same Broadcom chipset, but unlike the DELL card, the HP card provides no mechanism to disable TOE. Or at least no published mechanism in the BIOS or available to Linux, and no jumpers on the PCB. The problem occurs under heavy loading. The NIC's ability to handle TCP packets falls to about a tenth of it's normal rate, which is normally 100MB/sec on our set-up. Therefore rendering DRBD and our MySql database unusable for a few minutes. I would like to ask if there is anything that can be done in DRBD to get round this problem, like for instance using UDP instead of TCP, or some bug-fix for TOE which any member may know about? If this is not the case we will have to replace our NIC's, which is really not something we want to do, since all HP NIC's for HP servers seem to have the same chipset. Any advise would be extremely welcome! Regards, Ben Clewett. System details: SUSE 10.2, Linux 2.6.18.2, 64bit quad-processor, 10GB memory on HP Proliant DL380 G5. Disk IO rate ~ 250 MB/sec, replication on dedicated 1000baseSX. version: 8.2.1 (api:86/proto:86-87) GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by root at hp-tm-02, 2007-12-10 22:21:14 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate B r--- ns:1053120 nr:28862256 dw:28864144 dr:1065701 al:46 bm:392 lo:1 pe:0 ua:1 ap:0 resync: used:0/31 hits:65607 misses:217 starving:0 dirty:0 changed:217 act_log: used:1/257 hits:427 misses:46 starving:0 dirty:0 changed:46 1: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate B r--- ns:41783720 nr:1053224 dw:42836948 dr:7104597 al:144980 bm:422 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:65595 misses:213 starving:0 dirty:0 changed:213 act_log: used:1/257 hits:10300951 misses:145315 starving:0 dirty:335 changed:144980 global { # minor-count 64; # dialog-refresh 5; # 5 seconds # disable-ip-verification; usage-count no; } common { handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; outdate-peer "/usr/sbin/drbd-peer-outdater"; } startup { # Default is 0, which means unlimited. Unit is seconds. # wfc-timeout 0; # Wait for connection timeout if this node was a degraded cluster. degr-wfc-timeout 120; # 2 minutes. } disk { on-io-error detach; # fencing resource-only; } net { # this is the size of the tcp socket send buffer # increase it _carefully_ if you want to use protocol A over a # high latency network with reasonable write throughput. # defaults to 2*65535; you might try even 1M, but if your kernel or # network driver chokes on that, you have been warned. # sndbuf-size 512k; # timeout 60; # 6 seconds (unit = 0.1 seconds) # connect-int 10; # 10 seconds (unit = 1 second) # ping-int 10; # 10 seconds (unit = 1 second) # ping-timeout 5; # 500 ms (unit = 0.1 seconds) # Maximal number of requests (4K) to be allocated by DRBD. # The minimum is hardcoded to 32 (=128 kByte). # For high performance installations it might help if you # increase that number. These buffers are used to hold # datablocks while they are written to disk. # max-buffers 40000; # When the number of outstanding requests on a standby (secondary) # node exceeds bdev-threshold, we start to kick the backing device # to start its request processing. This is an advanced tuning # parameter to get more performance out of capable storage controlers. # Some controlers like to be kicked often, other controlers # deliver better performance when they are kicked less frequently. # Set it to the value of max-buffers to get the least possible # number of run_task_queue_disk() / q->unplug_fn(q) calls. # # unplug-watermark 128; # The highest number of data blocks between two write barriers. # If you set this < 10 you might decrease your performance. # max-epoch-size 2048; # if some block send times out this many times, the peer is # considered dead, even if it still answers ping requests. # ko-count 4; # If you want to use OCFS2/openGFS on top of DRBD enable # this optione, and only enable it if you are going to use # one of these filesystems. Do not enable it for ext2, # ext3,reiserFS,XFS,JFS etc... # allow-two-primaries; # This enables peer authentication. Without this everybody # on the network could connect to one of your DRBD nodes with # a program that emulates DRBD's protocoll and could suck off # all your data. # Specify one of the kernel's digest algorithms, e.g.: # md5, sha1, sha256, sha512, wp256, wp384, wp512, michael_mic ... # an a shared secret. # Authentication is only done once after the TCP connection # is establised, there are no disadvantages from using authentication, # therefore I suggest to enable it in any case. # cram-hmac-alg "sha1"; # shared-secret "FooFunFactory"; # In case the nodes of your cluster nodes see each other again, after # an split brain situation in which both nodes where primary # at the same time, you have two diverged versions of your data. # # In case both nodes are secondary you can control DRBD's # auto recovery strategy by the "after-sb-0pri" options. The # default is to disconnect. # "disconnect" ... No automatic resynchronisation, simply disconnect. # "discard-younger-primary" # Auto sync from the node that was primary before # the split brain situation happened. # "discard-older-primary" # Auto sync from the node that became primary # as second during the split brain situation. # "discard-least-changes" # Auto sync from the node that touched more # blocks during the split brain situation. # "discard-node-NODENAME" # Auto sync _to_ the named node. after-sb-0pri disconnect; # In one of the nodes is already primary, then the auto-recovery # strategie is controled by the "after-sb-1pri" options. # "disconnect" ... always disconnect # "consensus" ... discard the version of the secondary if the outcome # of the "after-sb-0pri" algorithm would also destroy # the current secondary's data. Otherwise disconnect. # "violently-as0p" Always take the decission of the "after-sb-0pri" # algorithm. Even if that causes case an erratic change # of the primarie's view of the data. # This is only usefull if you use an 1node FS (i.e. # not OCFS2 or GFS) with the allow-two-primaries # flag, _AND_ you really know what you are doing. # This is DANGEROUS and MAY CRASH YOUR MACHINE if you # have a FS mounted on the primary node. # "discard-secondary" # discard the version of the secondary. # "call-pri-lost-after-sb" Always honour the outcome of the "after-sb-0pri" # algorithm. In case it decides the the current # secondary has the right data, it panics the # current primary. # "suspend-primary" ??? after-sb-1pri disconnect; # In case both nodes are primary you control DRBD's strategy by # the "after-sb-2pri" option. # "disconnect" ... Go to StandAlone mode on both sides. # "violently-as0p" Always take the decission of the "after-sb-0pri". # "call-pri-lost-after-sb" ... Honor the outcome of the "after-sb-0pri" # algorithm and panic the other node. after-sb-2pri disconnect; # To solve the cases when the outcome of the resync descissions is # incompatible to the current role asignment in the cluster. # "disconnect" ... No automatic resynchronisation, simply disconnect. # "violently" .... Sync to the primary node is allowed, violating the # assumption that data on a block device is stable # for one of the nodes. DANGEROUS, DO NOT USE. # "call-pri-lost" Call the "pri-lost" helper program on one of the # machines. This program is expected to reboot the # machine. (I.e. make it secondary.) rr-conflict disconnect; # DRBD-0.7's behaviour is equivalent to # after-sb-0pri discard-younger-primary; # after-sb-1pri consensus; # after-sb-2pri disconnect; } syncer { # Limit the bandwith used by the resynchronisation process. # default unit is kByte/sec; optional suffixes K,M,G are allowed. # # Even though this is a network setting, the units are based # on _byte_ (octet for our french friends) not bit. # We are storage guys. # # Note that on 100Mbit ethernet, you cannot expect more than # 12.5 MByte total transfer rate. # Consider using GigaBit Ethernet. # rate 100M; # Configures the size of the active set. Each extent is 4M, # 257 Extents ~> 1GB active set size. In case your syncer # runs @ 10MB/sec, all resync after a primary's crash will last # 1GB / ( 10MB/sec ) ~ 102 seconds ~ One Minute and 42 Seconds. # BTW, the hash algorithm works best if the number of al-extents # is prime. (To test the worst case performace use a power of 2) al-extents 257; } } resource dbms-07-01 { protocol B; on hp-tm-02 { device /dev/drbd0; disk /dev/cciss/c0d0p5; address 192.168.95.5:7788; meta-disk /dev/cciss/c0d0p3[0]; } on hp-tm-04 { device /dev/drbd0; disk /dev/cciss/c0d0p5; address 192.168.95.6:7789; meta-disk /dev/cciss/c0d0p3[1]; } } resource dbms-07-02 { protocol B; on hp-tm-02 { device /dev/drbd1; disk /dev/cciss/c0d0p6; address 192.168.95.5:7789; meta-disk /dev/cciss/c0d0p3[1]; } on hp-tm-04 { device /dev/drbd1; disk /dev/cciss/c0d0p6; address 192.168.95.6:7788; meta-disk /dev/cciss/c0d0p3[0]; } } ************************************************************************* This e-mail is confidential and may be legally privileged. It is intended solely for the use of the individual(s) to whom it is addressed. Any content in this message is not necessarily a view or statement from Road Tech Computer Systems Limited but is that of the individual sender. If you are not the intended recipient, be advised that you have received this e-mail in error and that any use, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. We use reasonable endeavours to virus scan all e-mails leaving the company but no warranty is given that this e-mail and any attachments are virus free. You should undertake your own virus checking. The right to monitor e-mail communications through our networks is reserved by us Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley, Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17 Registered in England No: 02017435, Registered Address: Charter Court, Midland Road, Hemel Hempstead, Hertfordshire, HP2 5GE. *************************************************************************