[DRBD-user] problem - DRBD Version: 8.3.4 - block drbd0: p read: error=-11 - block drbd0: Local READ failed - while doing a chown -R mysql:mysql

Thu Dec 31 11:37:05 CET 2009

Problem description:

I'm testing drbd in combination w/ heartbeat, it's a fresh new
installation, I also re-installed, drdb, recreated the file systems, 

verified &  efsck'd. I'm still seeing these (none-fatal, I have set
passing on) i/o read errors when the drbd device is made primary,
secondary and primary again, the file system got mounted and then I do a
simple "chown -R mysql:mysql /drbd0/mysql"

I' getting these errors in /var/log/messages: 

Dec 31 10:44:04 ndb1-test kernel: block drbd0: role( Primary ->
Secondary )

Dec 31 10:44:04 ndb1-test kernel: block drbd0: role( Secondary ->
Primary )

Dec 31 10:44:04 ndb1-test kernel: kjournald starting.  Commit interval 5
seconds

Dec 31 10:44:04 ndb1-test kernel: EXT3 FS on drbd0, internal journal

Dec 31 10:44:04 ndb1-test kernel: EXT3-fs: mounted filesystem with
ordered data mode.

Dec 31 10:44:04 ndb1-test kernel: block drbd0: p read: error=-11

Dec 31 10:44:04 ndb1-test kernel: block drbd0: Local READ failed
sec=11238600s size=4096

Dec 31 10:44:04 ndb1-test kernel: block drbd0: Local IO failed in
__req_mod.Passing error on...

Dec 31 10:44:04 ndb1-test kernel: block drbd0: p read: error=-11

Dec 31 10:44:04 ndb1-test kernel: block drbd0: Local READ failed
sec=11933784s size=4096

Dec 31 10:44:04 ndb1-test kernel: block drbd0: Local IO failed in
__req_mod.Passing error on...

Dec 31 10:44:04 ndb1-test kernel: block drbd0: p read: error=-11

Dec 31 10:44:04 ndb1-test kernel: block drbd0: Local READ failed
sec=12363408s size=4096

Dec 31 10:44:04 ndb1-test kernel: block drbd0: Local IO failed in
__req_mod.Passing error on...

My local device's /sda, sdb->lvm are NOT failing (all other stuff works,
no scsi raid errors whatsoever, fsck says all is okay).

I isolated this reproduction scenario: (using heartbeat scripts) (log
above is created with these statements)

/etc/ha.d/resource.d/Filesystem /dev/drbd0 /drbd0 ext3 stop

/etc/ha.d/resource.d/drbddisk r0 stop

/etc/ha.d/resource.d/drbddisk r0 start

/etc/ha.d/resource.d/Filesystem /dev/drbd0 /drbd0 ext3 start

chown -R mysql:mysql /drbd0/mysql

The -11 error suggest that drbd is getting the OS error code  11:
Resource temporarily unavailable, and might be ignored in this case,
despite the error's, the ownership is changed to user/grp mysql.mysql
recursive in the directory. 

I think this might be a defect? It would be a shame of the i/o failure
detection capability to turn it off (pass-on) as a workaround. 

When I do not pass on, drbd declares my server Diskless which is not a
realistic scenario when running mysql under heavy load.

A workaround for MySQL is to remove the chown statement from the start
script, however that's also adding a risk as the statement was put there
with a purpose J

Context:

I'm using the conf below on a somewhat older version of RH Linux
ndb1-test.momac.net 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:32:14 EDT 2005
i686 i686 i386 GNU/Linux, Red Hat Enterprise Linux ES release 4 (Nahant
Update 2). I'm using lvm2 for my devices.

DRBD Version: 8.3.4 (api:88) (created version from source).

DRBDADM_BUILDTAG=GIT-hash:\ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\ build\
by\ root at ndb1-test.momac.net\,\ 2009-12-29\ 13:44:59

DRBDADM_API_VERSION=88

DRBD_KERNEL_VERSION_CODE=0x080304

DRBDADM_VERSION_CODE=0x080304

DRBDADM_VERSION=8.3.4

# drbd.conf example

#

# parameters you _need_ to change are the hostname, device, disk, #
meta-disk, address and port in the "on <hostname> {}" sections.

#

# you ought to know about the protocol, and the various timeouts.

#

# you probably want to set the rate in the syncer sections

#

# NOTE common pitfall:

# rate is given in units of _byte_ not bit #

#

# increase timeout and maybe ping-int in net{}, if you see # problems
with "connection lost/connection established"

# (or change your setup to reduce network latency; make sure full #
duplex behaves as such; check average roundtrip times while #  network
is saturated; and so on ...) #

skip {

  As you can see, you can also comment chunks of text

  with a 'skip[optional nonsense]{ skipped text }' section.

  This comes in handy, if you just want to comment out

  some 'resource <some name> {...}' section:

  just precede it with 'skip'.

  The basic format of option assignment is

  <option name><linear whitespace><value>;

  It should be obvious from the examples below,

  but if you really care to know the details:

  <option name> :=

        valid options in the respective scope

  <value>  := <num>|<string>|<choice>|...

              depending on the set of allowed values

              for the respective option.

  <num>    := [0-9]+, sometimes with an optional suffix of K,M,G

  <string> := (<name>|\"([^\"\\\n]*|\\.)*\ <file:///\\\n]*|\.)*\> ")+

  <name>   := [/_.A-Za-z0-9-]+

}

#

# At most ONE global section is allowed.

# It must precede any resource section.

#

global {

    # By default we load the module with a minor-count of 32. In case
you

    # have more devices in your config, the module gets loaded with

    # a minor-count that ensures that you have 10 minors spare.

    # In case 10 spare minors are too little for you, you can set the

    # minor-count exeplicit here. ( Note, in contrast to DRBD-0.7 an

    # unused, spare minor has only a very little overhead of allocated

    # memory (a single pointer to be exact). )

    #

    # minor-count 64;

    # The user dialog counts and displays the seconds it waited so

    # far. You might want to disable this if you have the console

    # of your server connected to a serial terminal server with

    # limited logging capacity.

    # The Dialog will print the count each 'dialog-refresh' seconds,

    # set it to 0 to disable redrawing completely. [ default = 1 ]

    #

    # dialog-refresh 5; # 5 seconds

    # You might disable one of drbdadm's sanity check.

    # disable-ip-verification;

    # Participate in DRBD's online usage counter at
http://usage.drbd.org

    # possilbe options: ask, yes, no. Default is ask. In case you do not

    # know, set it to ask, and follow the on screen instructions later.

    usage-count yes;

}

#

# The common section can have all the sections a resource can have but #
not the host section (started with the "on" keyword).

# The common section must precede all resources.

# All resources inherit the settings from the common section.

# Whereas settings in the resources have precedence over the common #
setting.

#

common {

  syncer { rate 10M; }

}

#

# this need not be r#, you may use phony resource names, # like
"resource web" or "resource mail", too #

resource r0 {

  # transfer protocol to use.

  # C: write IO is reported as completed, if we know it has

  #    reached _both_ local and remote DISK.

  #    * for critical transactional data.

  # B: write IO is reported as completed, if it has reached

  #    local DISK and remote buffer cache.

  #    * for most cases.

  # A: write IO is reported as completed, if it has reached

  #    local DISK and local tcp send buffer. (see also sndbuf-size)

  #    * for high latency networks

  #

  #**********

  # uhm, benchmarks have shown that C is actually better than B.

  # this note shall disappear, when we are convinced that B is

  # the right choice "for most cases".

  # Until then, always use C unless you have a reason not to.

  #        --lge

  #**********

  #

  protocol C;

  handlers {

    # what should be done in case the node is primary, degraded

    # (=no connection) and has inconsistent data.

    pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";

    # The node is currently primary, but lost the after split brain

    # auto recovery procedure. As as consequence it should go away.

    pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";

    # In case you have set the on-io-error option to
"call-local-io-error",

    # this script will get executed in case of a local IO error. It is

    # expected that this script will case a immediate failover in the

    # cluster.

    local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger
; halt -f";

    # Commands to run in case we need to downgrade the peer's disk

    # state to "Outdated". Should be implemented by the superior

    # communication possibilities of our cluster manager.

    # The provided script uses ssh, and is for demonstration/development

    # purposis.

    # fence-peer "/usr/lib/drbd/outdate-peer.sh on amd 192.168.22.11
192.168.23.11 on alf 192.168.22.12 192.168.23.12";

    #

    # Update: Now there is a solution that relies on heartbeat's

    # communication layers. You should really use this.

    #fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";

    # For Pacemaker you might use:

    # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";

    # The node is currently primary, but should become sync target

    # after the negotiating phase. Alert someone about this incident.

    #pri-lost "/usr/lib/drbd/notify-pri-lost.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ;
reboot -f";

    # Notify someone in case DRBD split brained. 

    #split-brain ""/usr/lib/drbd/notify-split-brain.sh root";

    # Notify someone in case an online verify run found the backing
devices out of sync.

    #out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";

    #

    # These two handlers can be used to snapshot sync-target devices

    # before for the time of the resync.

    # The provided scripts has these options:

    # -p | --percent <reserve space in percent of the original volume.
Default: 10%>

    # -a | --additional <snapshot space in KiB. Default: 10 MiB>

    # -n | --disconnect-on-error

    #    By default the script tells DRBD to do the resync no matter

    #    if the taking the snapshot works or not.

    #    If you prefer to drop connection in case taking the snapshot

    #    failes use the --disconnect-on-error option.

    # -v | --verbose

    # -- <additional lvcreate options>

    #before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh
-p 15 -- -c 16k";

    #after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;

  }

  startup {

    # Wait for connection timeout.

    # The init script blocks the boot process until the resources

    # are connected. This is so when the cluster manager starts later,

    # it does not see a resource with internal split-brain.

    # In case you want to limit the wait time, do it here.

    # Default is 0, which means unlimited. Unit is seconds.

    #

    # wfc-timeout  0;

    # Wait for connection timeout if this node was a degraded cluster.

    # In case a degraded cluster (= cluster with only one node left)

    # is rebooted, this timeout value is used.

    #

    degr-wfc-timeout 120;    # 2 minutes.

    # Wait for connection timeout if the peer node is already outdated.

    # (Do not set this to 0, since that means unlimited)

    #

    outdated-wfc-timeout 2;  # 2 seconds.

    # In case there was a split brain situation the devices will

    # drop their network configuration instead of connecting. Since

    # this means that the network is working, the cluster manager

    # should be able to communicate as well. Therefore the default

    # of DRBD's init script is to terminate in this case. To make

    # it to continue waiting in this case set this option.

    # 

    # wait-after-sb;

    # In case you are using DRBD for GFS/OCFS2 you want that the

    # startup script promotes it to primary. Nodenames are also

    # possible instead of "both".

    # become-primary-on both;

  }

  disk {

    # if the lower level device reports io-error you have the choice of

    #  "pass_on"  ->  Report the io-error to the upper layers.

    #                 Primary   -> report it to the mounted file system.

    #                 Secondary -> ignore it.

    #  "call-local-io-error"

    #                ->  Call the script configured by the name
"local-io-error".

    #  "detach"   ->  The node drops its backing storage device, and

    #                 continues in disk less mode.

    #

    #on-io-error   detach;

    on-io-error   "pass_on";

    # Controls the fencing policy, default is "dont-care". Before you

    # set any policy you need to make sure that you have a working

    # fence-peer handler. Possible values are:

    #  "dont-care"     -> Never call the fence-peer handler. [ DEFAULT ]

    #  "resource-only" -> Call the fence-peer handler if we primary and

    #                                loose the connection to the
secondary. As well

    #                                whenn a unconnected secondary wants
to become

    #                                primary.

    #  "resource-and-stonith"

    #                  -> Calls the fence-peer handler and freezes local

    #                     IO immediately after loss of connection. This
is

    #                                necessary if your heartbeat can
STONITH the other

    #                     node.

    # fencing resource-only;

    # In case you only want to use a fraction of the available space

    # you might use the "size" option here.

    #

    # size 10G;

    # In case you are sure that your storage subsystem has battery

    # backed up RAM and you know from measurements that it really honors

    # flush instructions by flushing data out from its non volatile

    # write cache to disk, you have double security. You might then

    # reduce this to single security by disabling disk flushes with

    # this option. It might improve performance in this case.

    # ONLY USE THIS OPTION IF YOU KNOW WHAT YOU ARE DOING.

    # no-disk-flushes;

    # no-md-flushes;

    # In some special circumstances the device mapper stack manages to

    # pass BIOs to DRBD that violate the constraints that are set forth

    # by DRBD's merge_bvec() function and which have more than one bvec.

    # A known example is:

    # phys-disk -> DRBD -> LVM -> Xen -> missaligned partition (63) ->
DomU FS

    # Then you might see "bio would need to, but cannot, be split:" in

    # the Dom0's kernel log.

    # The best workaround is to proper align the partition within

    # the VM (E.g. start it at sector 1024). (Costs 480 KiByte of
storage)

    # Unfortunately the default of most Linux partitioning tools is

    # to start the first partition at an odd number (63). Therefore

    # most distribution's install helpers for virtual linux machines
will

    # end up with missaligned partitions.

    # The second best workaround is to limit DRBD's max bvecs per BIO

    # (= max-bio-bvecs) to 1. (Costs performance).

    # max-bio-bvecs 1;

  }

  net {

    # this is the size of the tcp socket send buffer

    # increase it _carefully_ if you want to use protocol A over a

    # high latency network with reasonable write throughput.

    # defaults to 2*65535; you might try even 1M, but if your kernel or

    # network driver chokes on that, you have been warned.

    # sndbuf-size 512k;

    # timeout       60;    #  6 seconds  (unit = 0.1 seconds)

    # connect-int   10;    # 10 seconds  (unit = 1 second)

    # ping-int      10;    # 10 seconds  (unit = 1 second)

    # ping-timeout   5;    # 500 ms (unit = 0.1 seconds)

    # Maximal number of requests (4K) to be allocated by DRBD.

    # The minimum is hardcoded to 32 (=128 kByte).

    # For high performance installations it might help if you

    # increase that number. These buffers are used to hold

    # datablocks while they are written to disk.

    #

    # max-buffers     2048;

    # When the number of outstanding requests on a standby (secondary)

    # node exceeds bdev-threshold, we start to kick the backing device

    # to start its request processing. This is an advanced tuning

    # parameter to get more performance out of capable storage
controlers.

    # Some controlers like to be kicked often, other controlers

    # deliver better performance when they are kicked less frequently.

    # Set it to the value of max-buffers to get the least possible

    # number of run_task_queue_disk() / q->unplug_fn(q) calls.

    #

    # unplug-watermark   128;

    # The highest number of data blocks between two write barriers.

    # If you set this < 10 you might decrease your performance.

    # max-epoch-size  2048;

    # if some block send times out this many times, the peer is

    # considered dead, even if it still answers ping requests.

    # ko-count 4;

    # If you want to use OCFS2/openGFS on top of DRBD enable

    # this optione, and only enable it if you are going to use

    # one of these filesystems. Do not enable it for ext2,

    # ext3,reiserFS,XFS,JFS etc...

    # allow-two-primaries;

    # This enables peer authentication. Without this everybody

    # on the network could connect to one of your DRBD nodes with

    # a program that emulates DRBD's protocoll and could suck off

    # all your data.

    # Specify one of the kernel's digest algorithms, e.g.:

    # md5, sha1, sha256, sha512, wp256, wp384, wp512, michael_mic ...

    # an a shared secret.

    # Authentication is only done once after the TCP connection

    # is establised, there are no disadvantages from using
authentication,

    # therefore I suggest to enable it in any case.

    # cram-hmac-alg "sha1";

    # shared-secret "FooFunFactory";

    # In case the nodes of your cluster nodes see each other again,
after

    # an split brain situation in which both nodes where primary

    # at the same time, you have two diverged versions of your data.

    #

    # In case both nodes are secondary you can control DRBD's

    # auto recovery strategy by the "after-sb-0pri" options. The

    # default is to disconnect.

    #    "disconnect" ... No automatic resynchronisation, simply
disconnect.

    #    "discard-younger-primary"

    #                     Auto sync from the node that was primary
before

    #                     the split brain situation happened.

    #    "discard-older-primary"

    #                     Auto sync from the node that became primary

    #                     as second during the split brain situation.

    #    "discard-least-changes"

    #                     Auto sync from the node that touched more

    #                     blocks during the split brain situation.

    #    "discard-node-NODENAME"

    #                     Auto sync _to_ the named node.

    after-sb-0pri discard-younger-primary;

    # In one of the nodes is already primary, then the auto-recovery

    # strategie is controled by the "after-sb-1pri" options.

    #    "disconnect" ... always disconnect

    #    "consensus"  ... discard the version of the secondary if the
outcome

    #                     of the "after-sb-0pri" algorithm would also
destroy

    #                     the current secondary's data. Otherwise
disconnect.

    #    "violently-as0p" Always take the decission of the
"after-sb-0pri"

    #                     algorithm. Even if that causes case an erratic
change

    #                            of the primarie's view of the data.

    #                     This is only usefull if you use an 1node FS
(i.e.

    #                            not OCFS2 or GFS) with the
allow-two-primaries

    #                            flag, _AND_ you really know what you
are doing.

    #                            This is DANGEROUS and MAY CRASH YOUR
MACHINE if you

    #                            have a FS mounted on the primary node.

    #    "discard-secondary"

    #                     discard the version of the secondary.

    #    "call-pri-lost-after-sb"  Always honour the outcome of the
"after-sb-0pri"

    #                     algorithm. In case it decides the the current

    #                     secondary has the right data, it panics the

    #                     current primary.

    #    "suspend-primary" ???

    after-sb-1pri consensus;

    # In case both nodes are primary you control DRBD's strategy by

    # the "after-sb-2pri" option.

    #    "disconnect" ... Go to StandAlone mode on both sides.

    #    "violently-as0p" Always take the decission of the
"after-sb-0pri".

    #    "call-pri-lost-after-sb" ... Honor the outcome of the
"after-sb-0pri"

    #                     algorithm and panic the other node.

    after-sb-2pri disconnect;

    # To solve the cases when the outcome of the resync descissions is

    # incompatible to the current role asignment in the cluster.

    #    "disconnect" ... No automatic resynchronisation, simply
disconnect.

    #    "violently" .... Sync to the primary node is allowed, violating
the

    #                        assumption that data on a block device is
stable

    #                            for one of the nodes. DANGEROUS, DO NOT
USE.

    #    "call-pri-lost"  Call the "pri-lost" helper program on one of
the

    #                        machines. This program is expected to
reboot the

    #                     machine. (I.e. make it secondary.)

    rr-conflict disconnect;

    # DRBD-0.7's behaviour is equivalent to

    #   after-sb-0pri discard-younger-primary;

    #   after-sb-1pri consensus;

    #   after-sb-2pri disconnect;

    # DRBD can ensure the data integrity of the user's data on the
network

    # by comparing hash values. 

    # Note: Normally this is ensured by the 16 bit checksums in the
headers 

    # of TCP/IP packets. Unforunately it turned out that GBit NICs with 

    # various offloading engines might produce valid checksums for
corrupted 

    # data. Use this option during your pre-production tests, usually
you

    # want to turn it off for production to reduce CPU overhead.

    # Note2: If data blocks that gets written to disk are changed while
the

    # transfer goes on cause false positives. Known block device users
which

    # do so are the swap code and ReiserFS

    # data-integrity-alg "md5";

    # DRBD usually uses the TCP socket option TCP_CORK to hint to the
network

    # stack when it can expect more data, and when it should flush out
what it

    # has in its send queue. It turned out that there is at lease one
network

    # stack that performs worse when one uses this hinting method.
Therefore

    # we introducted this option, which disable the setting and clearing
of

    # the TCP_CORK socket option by DRBD.

    # no-tcp-cork;

  }

  syncer {

    # Limit the bandwith used by the resynchronisation process.

    # default unit is kByte/sec; optional suffixes K,M,G are allowed.

    #

    # Even though this is a network setting, the units are based

    # on _byte_ (octet for our french friends) not bit.

    # We are storage guys.

    #

    # Note that on 100Mbit ethernet, you cannot expect more than

    # 12.5 MByte total transfer rate.

    # Consider using GigaBit Ethernet.

    #

    # gigabit max around 115M;

    rate 60M;

    # Normally all devices are resynchronized parallel.

    # To achieve better resynchronisation performance you should resync

    # DRBD resources which have their backing storage on one physical

    # disk sequentially. The express this use the "after" keyword.

    #after "r2";

    # Configures the size of the active set. Each extent is 4M,

    # 257 Extents ~> 1GB active set size. In case your syncer

    # runs @ 10MB/sec, all resync after a primary's crash will last

    # 1GB / ( 10MB/sec ) ~ 102 seconds ~ One Minute and 42 Seconds.

    # BTW, the hash algorithm works best if the number of al-extents

    # is prime. (To test the worst case performace use a power of 2)

    al-extents 257;

    # Sets the CPU affinity mask of DRBD's threads. Might be of interest

    # for advanced performance tuning.

    # cpu-mask 15;

    verify-alg md5;

  }

  on ndb1-test.momac.net {

    device     /dev/drbd0;

    disk       /dev/vg01/mysqldrbd0;

    address    192.168.100.21:7788;

    flexible-meta-disk  internal;

    # meta-disk is either 'internal' or '/dev/ice/name [idx]'

    #

    # You can use a single block device to store meta-data

    # of multiple DRBD's.

    # E.g. use meta-disk /dev/hde6[0]; and meta-disk /dev/hde6[1];

    # for two different resources. In this case the meta-disk

    # would need to be at least 256 MB in size.

    #

    # 'internal' means, that the last 128 MB of the lower device

    # are used to store the meta-data.

    # You must not give an index with 'internal'.

  }

  on ndb2-test.momac.net {

    device    /dev/drbd0;

    disk      /dev/VG00/mysqldrbd0;

    address   192.168.100.22:7788;

    meta-disk internal;

  }

}

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20091231/0f7fbc11/attachment.htm>