Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Problem description:
I'm testing drbd in combination w/ heartbeat, it's a fresh new
installation, I also re-installed, drdb, recreated the file systems,
verified & efsck'd. I'm still seeing these (none-fatal, I have set
passing on) i/o read errors when the drbd device is made primary,
secondary and primary again, the file system got mounted and then I do a
simple "chown -R mysql:mysql /drbd0/mysql"
I' getting these errors in /var/log/messages:
Dec 31 10:44:04 ndb1-test kernel: block drbd0: role( Primary ->
Secondary )
Dec 31 10:44:04 ndb1-test kernel: block drbd0: role( Secondary ->
Primary )
Dec 31 10:44:04 ndb1-test kernel: kjournald starting. Commit interval 5
seconds
Dec 31 10:44:04 ndb1-test kernel: EXT3 FS on drbd0, internal journal
Dec 31 10:44:04 ndb1-test kernel: EXT3-fs: mounted filesystem with
ordered data mode.
Dec 31 10:44:04 ndb1-test kernel: block drbd0: p read: error=-11
Dec 31 10:44:04 ndb1-test kernel: block drbd0: Local READ failed
sec=11238600s size=4096
Dec 31 10:44:04 ndb1-test kernel: block drbd0: Local IO failed in
__req_mod.Passing error on...
Dec 31 10:44:04 ndb1-test kernel: block drbd0: p read: error=-11
Dec 31 10:44:04 ndb1-test kernel: block drbd0: Local READ failed
sec=11933784s size=4096
Dec 31 10:44:04 ndb1-test kernel: block drbd0: Local IO failed in
__req_mod.Passing error on...
Dec 31 10:44:04 ndb1-test kernel: block drbd0: p read: error=-11
Dec 31 10:44:04 ndb1-test kernel: block drbd0: Local READ failed
sec=12363408s size=4096
Dec 31 10:44:04 ndb1-test kernel: block drbd0: Local IO failed in
__req_mod.Passing error on...
My local device's /sda, sdb->lvm are NOT failing (all other stuff works,
no scsi raid errors whatsoever, fsck says all is okay).
I isolated this reproduction scenario: (using heartbeat scripts) (log
above is created with these statements)
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /drbd0 ext3 stop
/etc/ha.d/resource.d/drbddisk r0 stop
/etc/ha.d/resource.d/drbddisk r0 start
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /drbd0 ext3 start
chown -R mysql:mysql /drbd0/mysql
The -11 error suggest that drbd is getting the OS error code 11:
Resource temporarily unavailable, and might be ignored in this case,
despite the error's, the ownership is changed to user/grp mysql.mysql
recursive in the directory.
I think this might be a defect? It would be a shame of the i/o failure
detection capability to turn it off (pass-on) as a workaround.
When I do not pass on, drbd declares my server Diskless which is not a
realistic scenario when running mysql under heavy load.
A workaround for MySQL is to remove the chown statement from the start
script, however that's also adding a risk as the statement was put there
with a purpose J
Context:
I'm using the conf below on a somewhat older version of RH Linux
ndb1-test.momac.net 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:32:14 EDT 2005
i686 i686 i386 GNU/Linux, Red Hat Enterprise Linux ES release 4 (Nahant
Update 2). I'm using lvm2 for my devices.
DRBD Version: 8.3.4 (api:88) (created version from source).
DRBDADM_BUILDTAG=GIT-hash:\ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\ build\
by\ root at ndb1-test.momac.net\,\ 2009-12-29\ 13:44:59
DRBDADM_API_VERSION=88
DRBD_KERNEL_VERSION_CODE=0x080304
DRBDADM_VERSION_CODE=0x080304
DRBDADM_VERSION=8.3.4
# drbd.conf example
#
# parameters you _need_ to change are the hostname, device, disk, #
meta-disk, address and port in the "on <hostname> {}" sections.
#
# you ought to know about the protocol, and the various timeouts.
#
# you probably want to set the rate in the syncer sections
#
# NOTE common pitfall:
# rate is given in units of _byte_ not bit #
#
# increase timeout and maybe ping-int in net{}, if you see # problems
with "connection lost/connection established"
# (or change your setup to reduce network latency; make sure full #
duplex behaves as such; check average roundtrip times while # network
is saturated; and so on ...) #
skip {
As you can see, you can also comment chunks of text
with a 'skip[optional nonsense]{ skipped text }' section.
This comes in handy, if you just want to comment out
some 'resource <some name> {...}' section:
just precede it with 'skip'.
The basic format of option assignment is
<option name><linear whitespace><value>;
It should be obvious from the examples below,
but if you really care to know the details:
<option name> :=
valid options in the respective scope
<value> := <num>|<string>|<choice>|...
depending on the set of allowed values
for the respective option.
<num> := [0-9]+, sometimes with an optional suffix of K,M,G
<string> := (<name>|\"([^\"\\\n]*|\\.)*\ <file:///\\\n]*|\.)*\> ")+
<name> := [/_.A-Za-z0-9-]+
}
#
# At most ONE global section is allowed.
# It must precede any resource section.
#
global {
# By default we load the module with a minor-count of 32. In case
you
# have more devices in your config, the module gets loaded with
# a minor-count that ensures that you have 10 minors spare.
# In case 10 spare minors are too little for you, you can set the
# minor-count exeplicit here. ( Note, in contrast to DRBD-0.7 an
# unused, spare minor has only a very little overhead of allocated
# memory (a single pointer to be exact). )
#
# minor-count 64;
# The user dialog counts and displays the seconds it waited so
# far. You might want to disable this if you have the console
# of your server connected to a serial terminal server with
# limited logging capacity.
# The Dialog will print the count each 'dialog-refresh' seconds,
# set it to 0 to disable redrawing completely. [ default = 1 ]
#
# dialog-refresh 5; # 5 seconds
# You might disable one of drbdadm's sanity check.
# disable-ip-verification;
# Participate in DRBD's online usage counter at
http://usage.drbd.org
# possilbe options: ask, yes, no. Default is ask. In case you do not
# know, set it to ask, and follow the on screen instructions later.
usage-count yes;
}
#
# The common section can have all the sections a resource can have but #
not the host section (started with the "on" keyword).
# The common section must precede all resources.
# All resources inherit the settings from the common section.
# Whereas settings in the resources have precedence over the common #
setting.
#
common {
syncer { rate 10M; }
}
#
# this need not be r#, you may use phony resource names, # like
"resource web" or "resource mail", too #
resource r0 {
# transfer protocol to use.
# C: write IO is reported as completed, if we know it has
# reached _both_ local and remote DISK.
# * for critical transactional data.
# B: write IO is reported as completed, if it has reached
# local DISK and remote buffer cache.
# * for most cases.
# A: write IO is reported as completed, if it has reached
# local DISK and local tcp send buffer. (see also sndbuf-size)
# * for high latency networks
#
#**********
# uhm, benchmarks have shown that C is actually better than B.
# this note shall disappear, when we are convinced that B is
# the right choice "for most cases".
# Until then, always use C unless you have a reason not to.
# --lge
#**********
#
protocol C;
handlers {
# what should be done in case the node is primary, degraded
# (=no connection) and has inconsistent data.
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
# The node is currently primary, but lost the after split brain
# auto recovery procedure. As as consequence it should go away.
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
# In case you have set the on-io-error option to
"call-local-io-error",
# this script will get executed in case of a local IO error. It is
# expected that this script will case a immediate failover in the
# cluster.
local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger
; halt -f";
# Commands to run in case we need to downgrade the peer's disk
# state to "Outdated". Should be implemented by the superior
# communication possibilities of our cluster manager.
# The provided script uses ssh, and is for demonstration/development
# purposis.
# fence-peer "/usr/lib/drbd/outdate-peer.sh on amd 192.168.22.11
192.168.23.11 on alf 192.168.22.12 192.168.23.12";
#
# Update: Now there is a solution that relies on heartbeat's
# communication layers. You should really use this.
#fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
# For Pacemaker you might use:
# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
# The node is currently primary, but should become sync target
# after the negotiating phase. Alert someone about this incident.
#pri-lost "/usr/lib/drbd/notify-pri-lost.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";
# Notify someone in case DRBD split brained.
#split-brain ""/usr/lib/drbd/notify-split-brain.sh root";
# Notify someone in case an online verify run found the backing
devices out of sync.
#out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
#
# These two handlers can be used to snapshot sync-target devices
# before for the time of the resync.
# The provided scripts has these options:
# -p | --percent <reserve space in percent of the original volume.
Default: 10%>
# -a | --additional <snapshot space in KiB. Default: 10 MiB>
# -n | --disconnect-on-error
# By default the script tells DRBD to do the resync no matter
# if the taking the snapshot works or not.
# If you prefer to drop connection in case taking the snapshot
# failes use the --disconnect-on-error option.
# -v | --verbose
# -- <additional lvcreate options>
#before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh
-p 15 -- -c 16k";
#after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
}
startup {
# Wait for connection timeout.
# The init script blocks the boot process until the resources
# are connected. This is so when the cluster manager starts later,
# it does not see a resource with internal split-brain.
# In case you want to limit the wait time, do it here.
# Default is 0, which means unlimited. Unit is seconds.
#
# wfc-timeout 0;
# Wait for connection timeout if this node was a degraded cluster.
# In case a degraded cluster (= cluster with only one node left)
# is rebooted, this timeout value is used.
#
degr-wfc-timeout 120; # 2 minutes.
# Wait for connection timeout if the peer node is already outdated.
# (Do not set this to 0, since that means unlimited)
#
outdated-wfc-timeout 2; # 2 seconds.
# In case there was a split brain situation the devices will
# drop their network configuration instead of connecting. Since
# this means that the network is working, the cluster manager
# should be able to communicate as well. Therefore the default
# of DRBD's init script is to terminate in this case. To make
# it to continue waiting in this case set this option.
#
# wait-after-sb;
# In case you are using DRBD for GFS/OCFS2 you want that the
# startup script promotes it to primary. Nodenames are also
# possible instead of "both".
# become-primary-on both;
}
disk {
# if the lower level device reports io-error you have the choice of
# "pass_on" -> Report the io-error to the upper layers.
# Primary -> report it to the mounted file system.
# Secondary -> ignore it.
# "call-local-io-error"
# -> Call the script configured by the name
"local-io-error".
# "detach" -> The node drops its backing storage device, and
# continues in disk less mode.
#
#on-io-error detach;
on-io-error "pass_on";
# Controls the fencing policy, default is "dont-care". Before you
# set any policy you need to make sure that you have a working
# fence-peer handler. Possible values are:
# "dont-care" -> Never call the fence-peer handler. [ DEFAULT ]
# "resource-only" -> Call the fence-peer handler if we primary and
# loose the connection to the
secondary. As well
# whenn a unconnected secondary wants
to become
# primary.
# "resource-and-stonith"
# -> Calls the fence-peer handler and freezes local
# IO immediately after loss of connection. This
is
# necessary if your heartbeat can
STONITH the other
# node.
# fencing resource-only;
# In case you only want to use a fraction of the available space
# you might use the "size" option here.
#
# size 10G;
# In case you are sure that your storage subsystem has battery
# backed up RAM and you know from measurements that it really honors
# flush instructions by flushing data out from its non volatile
# write cache to disk, you have double security. You might then
# reduce this to single security by disabling disk flushes with
# this option. It might improve performance in this case.
# ONLY USE THIS OPTION IF YOU KNOW WHAT YOU ARE DOING.
# no-disk-flushes;
# no-md-flushes;
# In some special circumstances the device mapper stack manages to
# pass BIOs to DRBD that violate the constraints that are set forth
# by DRBD's merge_bvec() function and which have more than one bvec.
# A known example is:
# phys-disk -> DRBD -> LVM -> Xen -> missaligned partition (63) ->
DomU FS
# Then you might see "bio would need to, but cannot, be split:" in
# the Dom0's kernel log.
# The best workaround is to proper align the partition within
# the VM (E.g. start it at sector 1024). (Costs 480 KiByte of
storage)
# Unfortunately the default of most Linux partitioning tools is
# to start the first partition at an odd number (63). Therefore
# most distribution's install helpers for virtual linux machines
will
# end up with missaligned partitions.
# The second best workaround is to limit DRBD's max bvecs per BIO
# (= max-bio-bvecs) to 1. (Costs performance).
# max-bio-bvecs 1;
}
net {
# this is the size of the tcp socket send buffer
# increase it _carefully_ if you want to use protocol A over a
# high latency network with reasonable write throughput.
# defaults to 2*65535; you might try even 1M, but if your kernel or
# network driver chokes on that, you have been warned.
# sndbuf-size 512k;
# timeout 60; # 6 seconds (unit = 0.1 seconds)
# connect-int 10; # 10 seconds (unit = 1 second)
# ping-int 10; # 10 seconds (unit = 1 second)
# ping-timeout 5; # 500 ms (unit = 0.1 seconds)
# Maximal number of requests (4K) to be allocated by DRBD.
# The minimum is hardcoded to 32 (=128 kByte).
# For high performance installations it might help if you
# increase that number. These buffers are used to hold
# datablocks while they are written to disk.
#
# max-buffers 2048;
# When the number of outstanding requests on a standby (secondary)
# node exceeds bdev-threshold, we start to kick the backing device
# to start its request processing. This is an advanced tuning
# parameter to get more performance out of capable storage
controlers.
# Some controlers like to be kicked often, other controlers
# deliver better performance when they are kicked less frequently.
# Set it to the value of max-buffers to get the least possible
# number of run_task_queue_disk() / q->unplug_fn(q) calls.
#
# unplug-watermark 128;
# The highest number of data blocks between two write barriers.
# If you set this < 10 you might decrease your performance.
# max-epoch-size 2048;
# if some block send times out this many times, the peer is
# considered dead, even if it still answers ping requests.
# ko-count 4;
# If you want to use OCFS2/openGFS on top of DRBD enable
# this optione, and only enable it if you are going to use
# one of these filesystems. Do not enable it for ext2,
# ext3,reiserFS,XFS,JFS etc...
# allow-two-primaries;
# This enables peer authentication. Without this everybody
# on the network could connect to one of your DRBD nodes with
# a program that emulates DRBD's protocoll and could suck off
# all your data.
# Specify one of the kernel's digest algorithms, e.g.:
# md5, sha1, sha256, sha512, wp256, wp384, wp512, michael_mic ...
# an a shared secret.
# Authentication is only done once after the TCP connection
# is establised, there are no disadvantages from using
authentication,
# therefore I suggest to enable it in any case.
# cram-hmac-alg "sha1";
# shared-secret "FooFunFactory";
# In case the nodes of your cluster nodes see each other again,
after
# an split brain situation in which both nodes where primary
# at the same time, you have two diverged versions of your data.
#
# In case both nodes are secondary you can control DRBD's
# auto recovery strategy by the "after-sb-0pri" options. The
# default is to disconnect.
# "disconnect" ... No automatic resynchronisation, simply
disconnect.
# "discard-younger-primary"
# Auto sync from the node that was primary
before
# the split brain situation happened.
# "discard-older-primary"
# Auto sync from the node that became primary
# as second during the split brain situation.
# "discard-least-changes"
# Auto sync from the node that touched more
# blocks during the split brain situation.
# "discard-node-NODENAME"
# Auto sync _to_ the named node.
after-sb-0pri discard-younger-primary;
# In one of the nodes is already primary, then the auto-recovery
# strategie is controled by the "after-sb-1pri" options.
# "disconnect" ... always disconnect
# "consensus" ... discard the version of the secondary if the
outcome
# of the "after-sb-0pri" algorithm would also
destroy
# the current secondary's data. Otherwise
disconnect.
# "violently-as0p" Always take the decission of the
"after-sb-0pri"
# algorithm. Even if that causes case an erratic
change
# of the primarie's view of the data.
# This is only usefull if you use an 1node FS
(i.e.
# not OCFS2 or GFS) with the
allow-two-primaries
# flag, _AND_ you really know what you
are doing.
# This is DANGEROUS and MAY CRASH YOUR
MACHINE if you
# have a FS mounted on the primary node.
# "discard-secondary"
# discard the version of the secondary.
# "call-pri-lost-after-sb" Always honour the outcome of the
"after-sb-0pri"
# algorithm. In case it decides the the current
# secondary has the right data, it panics the
# current primary.
# "suspend-primary" ???
after-sb-1pri consensus;
# In case both nodes are primary you control DRBD's strategy by
# the "after-sb-2pri" option.
# "disconnect" ... Go to StandAlone mode on both sides.
# "violently-as0p" Always take the decission of the
"after-sb-0pri".
# "call-pri-lost-after-sb" ... Honor the outcome of the
"after-sb-0pri"
# algorithm and panic the other node.
after-sb-2pri disconnect;
# To solve the cases when the outcome of the resync descissions is
# incompatible to the current role asignment in the cluster.
# "disconnect" ... No automatic resynchronisation, simply
disconnect.
# "violently" .... Sync to the primary node is allowed, violating
the
# assumption that data on a block device is
stable
# for one of the nodes. DANGEROUS, DO NOT
USE.
# "call-pri-lost" Call the "pri-lost" helper program on one of
the
# machines. This program is expected to
reboot the
# machine. (I.e. make it secondary.)
rr-conflict disconnect;
# DRBD-0.7's behaviour is equivalent to
# after-sb-0pri discard-younger-primary;
# after-sb-1pri consensus;
# after-sb-2pri disconnect;
# DRBD can ensure the data integrity of the user's data on the
network
# by comparing hash values.
# Note: Normally this is ensured by the 16 bit checksums in the
headers
# of TCP/IP packets. Unforunately it turned out that GBit NICs with
# various offloading engines might produce valid checksums for
corrupted
# data. Use this option during your pre-production tests, usually
you
# want to turn it off for production to reduce CPU overhead.
# Note2: If data blocks that gets written to disk are changed while
the
# transfer goes on cause false positives. Known block device users
which
# do so are the swap code and ReiserFS
# data-integrity-alg "md5";
# DRBD usually uses the TCP socket option TCP_CORK to hint to the
network
# stack when it can expect more data, and when it should flush out
what it
# has in its send queue. It turned out that there is at lease one
network
# stack that performs worse when one uses this hinting method.
Therefore
# we introducted this option, which disable the setting and clearing
of
# the TCP_CORK socket option by DRBD.
# no-tcp-cork;
}
syncer {
# Limit the bandwith used by the resynchronisation process.
# default unit is kByte/sec; optional suffixes K,M,G are allowed.
#
# Even though this is a network setting, the units are based
# on _byte_ (octet for our french friends) not bit.
# We are storage guys.
#
# Note that on 100Mbit ethernet, you cannot expect more than
# 12.5 MByte total transfer rate.
# Consider using GigaBit Ethernet.
#
# gigabit max around 115M;
rate 60M;
# Normally all devices are resynchronized parallel.
# To achieve better resynchronisation performance you should resync
# DRBD resources which have their backing storage on one physical
# disk sequentially. The express this use the "after" keyword.
#after "r2";
# Configures the size of the active set. Each extent is 4M,
# 257 Extents ~> 1GB active set size. In case your syncer
# runs @ 10MB/sec, all resync after a primary's crash will last
# 1GB / ( 10MB/sec ) ~ 102 seconds ~ One Minute and 42 Seconds.
# BTW, the hash algorithm works best if the number of al-extents
# is prime. (To test the worst case performace use a power of 2)
al-extents 257;
# Sets the CPU affinity mask of DRBD's threads. Might be of interest
# for advanced performance tuning.
# cpu-mask 15;
verify-alg md5;
}
on ndb1-test.momac.net {
device /dev/drbd0;
disk /dev/vg01/mysqldrbd0;
address 192.168.100.21:7788;
flexible-meta-disk internal;
# meta-disk is either 'internal' or '/dev/ice/name [idx]'
#
# You can use a single block device to store meta-data
# of multiple DRBD's.
# E.g. use meta-disk /dev/hde6[0]; and meta-disk /dev/hde6[1];
# for two different resources. In this case the meta-disk
# would need to be at least 256 MB in size.
#
# 'internal' means, that the last 128 MB of the lower device
# are used to store the meta-data.
# You must not give an index with 'internal'.
}
on ndb2-test.momac.net {
device /dev/drbd0;
disk /dev/VG00/mysqldrbd0;
address 192.168.100.22:7788;
meta-disk internal;
}
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20091231/0f7fbc11/attachment.htm>