[DRBD-user] DRBD Split-Brain, Configuration Issues

Wed Jan 30 01:53:00 CET 2008

Hello,

We are attempting to configure two servers with DRBD to host a drive
array for web hosting usage.  We've gotten DRBD configured between the
two and have configured Heartbeat to monitor it; however, we're having
problems with the failover and with split-brain problems.

Basically, once it's running, if we remove the primary node, it will
switch over properly to the secondary; however, once the primary comes
back up, it goes split-brain on us and disconnects, and we have no
choice but to manually re-sync the devices.

We tried setting up dopd per instructions to set one as outdated;
however, this didn't fix the split-brain issue and in fact, made things
worse - when we removed the primary, it took the secondary offline as
outdated and wouldn't come back up.  If I'm not mistaken, I think this
is actually the intended behavior of dopd, but I'd like to make sure
we're not being stupid.

So, a few questions: 

* How can we fix the split-brain issue?  Do we just have to modify
after-sb-0pri, 1pri and 2pri?  If so, what would be recommended for
something being used in a hosting environment?
* Should we have dopd running?
* Does anything else look totally out of place and incorrectly done?
We're new at this and we'd like to get some pointers from the list,
since you know more than we do :)

Thanks a lot!
Mike Sweetser

/etc/ha.d/ha.cf:
#
#       There are lots of options in this file.  All you have to have is
a set
#       of nodes listed {"node ...} one of {serial, bcast, mcast, or
ucast},
#       and a value for "auto_failback".
#
#       ATTENTION: As the configuration file is read line by line,
#                  THE ORDER OF DIRECTIVE MATTERS!
#
#       In particular, make sure that the udpport, serial baud rate
#       etc. are set before the heartbeat media are defined!
#       debug and log file directives go into effect when they
#       are encountered.
#
#       All will be fine if you keep them ordered as in this example.
#
#
#       Note on logging:
#       If any of debugfile, logfile and logfacility are defined then
they
#       will be used. If debugfile and/or logfile are not defined and
#       logfacility is defined then the respective logging and debug
#       messages will be loged to syslog. If logfacility is not defined
#       then debugfile and logfile will be used to log messges. If
#       logfacility is not defined and debugfile and/or logfile are not
#       defined then defaults will be used for debugfile and logfile as
#       required and messages will be sent there.
#
# Added CRM directive according to http://www.drbd.org/drbd8-howto2.html
#
crm yes
#
#
#       File to write debug messages to
debugfile /var/log/ha-debug
#
#
#       File to write other messages to
#
logfile /var/log/ha-log
#
#
#       Facility to use for syslog()/logger 
#
logfacility     local0
#
#
#       A note on specifying "how long" times below...
#
#       The default time unit is seconds
#               10 means ten seconds
#
#       You can also specify them in milliseconds
#               1500ms means 1.5 seconds
#
#
#       keepalive: how long between heartbeats?
#
keepalive 2
#
#       deadtime: how long-to-declare-host-dead?
#
#               If you set this too low you will get the problematic
#               split-brain (or cluster partition) problem.
#               See the FAQ for how to use warntime to tune deadtime.
#
deadtime 30
#
#       warntime: how long before issuing "late heartbeat" warning?
#       See the FAQ for how to use warntime to tune deadtime.
#
warntime 10
#
#
#       Very first dead time (initdead)
#
#       On some machines/OSes, etc. the network takes a while to come up
#       and start working right after you've been rebooted.  As a result
#       we have a separate dead time for when things first come up.
#       It should be at least twice the normal dead time.
#
initdead 120
#
#
#       What UDP port to use for bcast/ucast communication?
#
udpport 694
#
#       Baud rate for serial ports...
#
#baud   19200
#
#       serial  serialportname ...
#serial /dev/ttyS0      # Linux
#serial /dev/cuaa0      # FreeBSD
#serial /dev/cuad0      # FreeBSD 6.x
#serial /dev/cua/a      # Solaris
#
#
#       What interfaces to broadcast heartbeats over?
#
#bcast  eth0            # Linux
#bcast  eth1    # Linux
#bcast  le0             # Solaris
#bcast  le1 le2         # Solaris
#
#       Set up a multicast heartbeat medium
#       mcast [dev] [mcast group] [port] [ttl] [loop]
#
#       [dev]           device to send/rcv heartbeats on
#       [mcast group]   multicast group to join (class D multicast
address
#                       224.0.0.0 - 239.255.255.255)
#       [port]          udp port to sendto/rcvfrom (set this value to
the
#                       same value as "udpport" above)
#       [ttl]           the ttl value for outbound heartbeats.  this
effects
#                       how far the multicast packet will propagate.
(0-255)
#                       Must be greater than zero.
#       [loop]          toggles loopback for outbound multicast
heartbeats.
#                       if enabled, an outbound packet will be looped
back and
#                       received by the interface it was sent on. (0 or
1)
#                       Set this value to zero.
#
#
#mcast eth0 225.0.0.1 694 1 0
#
#       Set up a unicast / udp heartbeat medium
#       ucast [dev] [peer-ip-addr]
#
#       [dev]           device to send/rcv heartbeats on
#       [peer-ip-addr]  IP address of peer to send packets to
#
ucast eth1 172.16.2.1
#
#
#       About boolean values...
#
#       Any of the following case-insensitive values will work for true:
#               true, on, yes, y, 1
#       Any of the following case-insensitive values will work for
false:
#               false, off, no, n, 0
#
#
#
#       auto_failback:  determines whether a resource will
#       automatically fail back to its "primary" node, or remain
#       on whatever node is serving it until that node fails, or
#       an administrator intervenes.
#
#       The possible values for auto_failback are:
#               on      - enable automatic failbacks
#               off     - disable automatic failbacks
#               legacy  - enable automatic failbacks in systems
#                       where all nodes do not yet support
#                       the auto_failback option.
#
#       auto_failback "on" and "off" are backwards compatible with the
old
#               "nice_failback on" setting.
#
#       See the FAQ for information on how to convert
#               from "legacy" to "on" without a flash cut.
#               (i.e., using a "rolling upgrade" process)
#
#       The default value for auto_failback is "legacy", which
#       will issue a warning at startup.  So, make sure you put
#       an auto_failback directive in your ha.cf file.
#       (note: auto_failback can be any boolean or "legacy")
#
auto_failback off
#
#
#       Basic STONITH support
#       Using this directive assumes that there is one stonith 
#       device in the cluster.  Parameters to this device are 
#       read from a configuration file. The format of this line is:
#
#         stonith <stonith_type> <configfile>
#
#       NOTE: it is up to you to maintain this file on each node in the
#       cluster!
#
#stonith baytech /etc/ha.d/conf/stonith.baytech
#
#       STONITH support
#       You can configure multiple stonith devices using this directive.
#       The format of the line is:
#         stonith_host <hostfrom> <stonith_type> <params...>
#         <hostfrom> is the machine the stonith device is attached
#              to or * to mean it is accessible from any host. 
#         <stonith_type> is the type of stonith device (a list of
#              supported drives is in /usr/lib/stonith.)
#         <params...> are driver specific parameters.  To see the
#              format for a particular device, run:
#           stonith -l -t <stonith_type> 
#
#
#       Note that if you put your stonith device access information in
#       here, and you make this file publically readable, you're asking
#       for a denial of service attack ;-)
#
#       To get a list of supported stonith devices, run
#               stonith -L
#       For detailed information on which stonith devices are supported
#       and their detailed configuration options, run this command:
#               stonith -h
#
#stonith_host *     baytech 10.0.0.3 mylogin mysecretpassword
#stonith_host ken3  rps10 /dev/ttyS1 kathy 0 
#stonith_host kathy rps10 /dev/ttyS1 ken3 0 
#
#       Watchdog is the watchdog timer.  If our own heart doesn't beat
for
#       a minute, then our machine will reboot.
#       NOTE: If you are using the software watchdog, you very likely
#       wish to load the module with the parameter "nowayout=0" or
#       compile it without CONFIG_WATCHDOG_NOWAYOUT set. Otherwise even
#       an orderly shutdown of heartbeat will trigger a reboot, which is
#       very likely NOT what you want.
#
watchdog /dev/watchdog
#       
#       Tell what machines are in the cluster
#       node    nodename ...    -- must match uname -n
node    NODE1
node    NODE2
#
#       Less common options...
#
#       Treats 10.10.10.254 as a psuedo-cluster-member
#       Used together with ipfail below...
#       note: don't use a cluster node as ping node
#
ping 10.213.0.1
#
#       Treats 10.10.10.254 and 10.10.10.253 as a psuedo-cluster-member
#       called group1. If either 10.10.10.254 or 10.10.10.253 are up
#       then group1 is up
#       Used together with ipfail below...
#
#ping_group group1 10.10.10.254 10.10.10.253
#
#       HBA ping derective for Fiber Channel
#       Treats fc-card-name as psudo-cluster-member
#       used with ipfail below ...
#
#       You can obtain HBAAPI from http://hbaapi.sourceforge.net.  You
need 
#       to get the library specific to your HBA directly from the vender
#       To install HBAAPI stuff, all You need to do is to compile the
common
#       part you obtained from the sourceforge. This will produce
libHBAAPI.so 
#       which you need to copy to /usr/lib. You need also copy hbaapi.h
to 
#       /usr/include.
#
#       The fc-card-name is the name obtained from the hbaapitest
program 
#       that is part of the hbaapi package. Running hbaapitest will
produce
#       a verbose output. One of the first line is similar to:
#               Apapter number 0 is named: qlogic-qla2200-0
#       Here fc-card-name is qlogic-qla2200-0. 
#
#hbaping fc-card-name
#
#
#       Processes started and stopped with heartbeat.  Restarted unless
#               they exit with rc=100
#
#respawn userid hacluster
respawn hacluster /usr/lib/heartbeat/ipfail
#
#       Access control for client api
#               default is no access
#
#apiauth client-name gid=gidlist uid=uidlist
#apiauth ipfail gid=haclient uid=hacluster

###########################
#
#       Unusual options.
#
###########################
#
#       hopfudge maximum hop count minus number of nodes in config
#hopfudge 1
#
#       deadping - dead time for ping nodes
#deadping 30
#
#       hbgenmethod - Heartbeat generation number creation method
#               Normally these are stored on disk and incremented as
needed.
#hbgenmethod time
#
#       realtime - enable/disable realtime execution (high priority,
etc.)
#               defaults to on
#realtime off
#
#       debug - set debug level
#               defaults to zero
debug 1
#
#       API Authentication - replaces the fifo-permissions-based system
of the past
#
#
#       You can put a uid list and/or a gid list.
#       If you put both, then a process is authorized if it qualifies
under either
#       the uid list, or under the gid list.
#
#       The groupname "default" has special meaning.  If it is
specified, then
#       this will be used for authorizing groupless clients, and any
client groups
#       not otherwise specified.
#
#       There is a subtle exception to this.  "default" will never be
used in the 
#       following cases (actual default auth directives noted in
brackets)
#                 ipfail        (uid=HA_CCMUSER)
#                 ccm           (uid=HA_CCMUSER)
#                 ping          (gid=HA_APIGROUP)
#                 cl_status     (gid=HA_APIGROUP)
#
#       This is done to avoid creating a gaping security hole and
matches the most
#       likely desired configuration.
#
#apiauth ipfail uid=hacluster
#apiauth ccm uid=hacluster
#apiauth cms uid=hacluster
#apiauth ping gid=haclient uid=alanr,root
#apiauth default gid=haclient

#       message format in the wire, it can be classic or netstring, 
#       default: classic
#msgfmt  classic/netstring

#       Do we use logging daemon?
#       If logging daemon is used, logfile/debugfile/logfacility in this
file
#       are not meaningful any longer. You should check the config file
for logging
#       daemon (the default is /etc/logd.cf)
#       more infomartion can be fould in
http://www.linux-ha.org/ha_2ecf_2fUseLogdDirective
#       Setting use_logd to "yes" is recommended
#
# use_logd yes/no
#
#       the interval we  reconnect to logging daemon if the previous
connection failed
#       default: 60 seconds
#conn_logd_time 60
#
#
#       Configure compression module
#       It could be zlib or bz2, depending on whether u have the
corresponding 
#       library in the system.
#compression    bz2
#
#       Confiugre compression threshold
#       This value determines the threshold to compress a message,
#       e.g. if the threshold is 1, then any message with size greater
than 1 KB
#       will be compressed, the default is 2 (KB)
#compression_threshold 2

#respawn hacluster /usr/lib/heartbeat/dopd
#apiauth dopd gid=haclient uid=hacluster
----
/etc/drbd.conf:
global {
    # By default we load the module with a minor-count of 32. In case
you
    # have more devices in your config, the module gets loaded with
    # a minor-count that ensures that you have 10 minors spare.
    # In case 10 spare minors are too little for you, you can set the
    # minor-count exeplicit here. ( Note, in contrast to DRBD-0.7 an
    # unused, spare minor has only a very little overhead of allocated
    # memory (a single pointer to be exact). )
    #
    # minor-count 64;

    # The user dialog counts and displays the seconds it waited so
    # far. You might want to disable this if you have the console
    # of your server connected to a serial terminal server with
    # limited logging capacity.
    # The Dialog will print the count each 'dialog-refresh' seconds,
    # set it to 0 to disable redrawing completely. [ default = 1 ]
    #
    # dialog-refresh 5; # 5 seconds

    # You might disable one of drbdadm's sanity check.
    # disable-ip-verification;

    # Participate in DRBD's online usage counter at
http://usage.drbd.org
    # possilbe options: ask, yes, no. Default is ask. In case you do not
    # know, set it to ask, and follow the on screen instructions later.
    usage-count yes;
}

#
# The common section can have all the sections a resource can have but
# not the host section (started with the "on" keyword).
# The common section must precede all resources.
# All resources inherit the settings from the common section.
# Whereas settings in the resources have precedence over the common
# setting.
#

common {
  handlers {
#    outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater";
  }
  syncer { rate 30M; }
}

#
# this need not be r#, you may use phony resource names,
# like "resource web" or "resource mail", too
#

resource drbd0 {
  protocol C;

  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
  }

  startup {
    degr-wfc-timeout 120;    # 2 minutes.
  }

  disk {
#    fencing   resource-only;
    on-io-error   detach;
  }

  net {
    timeout       120;   
    connect-int   20;   
    ping-int      20;  
    ping-timeout   5; 
    max-buffers     2048;
    unplug-watermark   128;
    max-epoch-size  2048;
    ko-count 4;
    cram-hmac-alg "sha1";
    shared-secret "adhostdrbd";
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }

  syncer {
    rate 30M;
    al-extents 257;
  }

  on NODE2 {
    device     /dev/drbd0;
    disk       /dev/sda5;
    address    10.213.2.12:7788;
    flexible-meta-disk  internal;

  }

  on NODE1 {
    device    /dev/drbd0;
    disk      /dev/sda5;
    address   10.213.2.11:7788;
    meta-disk internal;
  }
}
----
/var/lib/heartbeat/crm/cib.xml:
 <cib epoch="4" admin_epoch="2" have_quorum="true" ignore_dtd="false"
num_peers="2" cib_feature_revision="2.0" generated="true"
num_updates="1" cib-last-written="Tue Jan 29 15:45:03 2008"
ccm_transition="2" dc_uuid="bf75c700-e99c-498f-8cdb-98cd53d1cdb4">
   <configuration>
     <crm_config>
       <cluster_property_set id="cib-bootstrap-options">
         <attributes>
           <nvpair id="cib-bootstrap-options-dc-version"
name="dc-version" value="2.1.3-node:
552305612591183b1628baa5bc6e903e0f1e26a3"/>
         </attributes>
       </cluster_property_set>
     </crm_config>
     <nodes>
       <node id="bf75c700-e99c-498f-8cdb-98cd53d1cdb4" uname="NODE2"
type="normal"/>
       <node id="90cda71e-dee8-4406-87d4-10c4514ed573" uname="NODE1"
type="normal"/>
     </nodes>
     <resources>
       <primitive id="ip_resource" class="ocf" type="IPaddr"
provider="heartbeat">
         <instance_attributes id="ma-ip">
           <attributes>
             <nvpair id="CLUSTER" name="ip" value="10.213.2.1"/>
           </attributes>
         </instance_attributes>
       </primitive>
       <master_slave id="ms-drbd0">
         <meta_attributes id="ma-ms-drbd0">
           <attributes>
             <nvpair id="ma-ms-drbd0-1" name="clone_max" value="2"/>
             <nvpair id="ma-ms-drbd0-2" name="clone_node_max"
value="1"/>
             <nvpair id="ma-ms-drbd0-3" name="master_max" value="1"/>
             <nvpair id="ma-ms-drbd0-4" name="master_node_max"
value="1"/>
             <nvpair id="ma-ms-drbd0-5" name="notify" value="yes"/>
             <nvpair id="ma-ms-drbd0-6" name="globally_unique"
value="false"/>
             <nvpair id="ma-ms-drbd0-7" name="target_role"
value="started"/>
           </attributes>
         </meta_attributes>
         <primitive id="drbd0" class="ocf" provider="heartbeat"
type="drbd">
           <instance_attributes id="ia-drbd0">
             <attributes>
               <nvpair id="ia-drbd0-1" name="drbd_resource"
value="drbd0"/>
             </attributes>
           </instance_attributes>
         </primitive>
       </master_slave>
       <primitive class="ocf" provider="heartbeat" type="Filesystem"
id="fs0">
         <meta_attributes id="ma-fs0">
           <attributes>
             <nvpair name="target_role" id="ma-fs0-1" value="started"/>
           </attributes>
         </meta_attributes>
         <instance_attributes id="ia-fs0">
           <attributes>
             <nvpair id="ia-fs0-1" name="fstype" value="ext3"/>
             <nvpair id="ia-fs0-2" name="directory"
value="/replicated"/>
             <nvpair id="ia-fs0-3" name="device" value="/dev/drbd0"/>
           </attributes>
         </instance_attributes>
       </primitive>
     </resources>
     <constraints>
       <rsc_location id="drbd0-placement-1" rsc="ms-drbd0">
         <rule id="drbd0-rule-1" score="-INFINITY">
           <expression id="exp-01" value="NODE1" attribute="#uname"
operation="ne"/>
           <expression id="exp-02" value="NODE2" attribute="#uname"
operation="ne"/>
         </rule>
       </rsc_location>
       <rsc_order id="drbd0_before_fs0" from="fs0" action="start"
to="ms-drbd0" to_action="promote"/>
       <rsc_colocation id="fs0_on_drbd0" to="ms-drbd0" to_role="master"
from="fs0" score="infinity"/>
       <rsc_location id="cli-standby-ip_resource" rsc="ip_resource">
         <rule id="cli-standby-rule-ip_resource" score="-INFINITY">
           <expression id="cli-standby-expr-ip_resource"
attribute="#uname" operation="eq" value="NODE2" type="string"/>
         </rule>
       </rsc_location>
     </constraints>
   </configuration>
 </cib>