[DRBD-user] two node (master/slave) failover not working

Mon Oct 1 22:36:40 CEST 2012

Greetings,
I've just started playing with pacemaker/corosync on a two node setup.
 Prior to doing anything with pacemaker & corosync, the DRBD setup was
working fine
 At this point I'm just experimenting, and trying to get a good feel
of how things work.  Eventually I'd like to start using this in a
production environment.  I'm running Fedora16-x86_64 with drbd-8.3.11,
pacemaker-1.1.7 & corosync-1.4.3.   I've verified that pacemaker is
doing the right
thing when initially configured.  Specifically:
* the floating static IP is brought up on the master
* DRBD is brought up correctly with a master & slave in sync
* the local DRBD backed storage filesystem mount point is mounted
correctly on the master

Here's the drbd resource configuration:
#########
resource r0 {
        device    /dev/drbd0;
        disk      /dev/sdb1;
        meta-disk internal;
        handlers {
                split-brain "/usr/lib/drbd/notify-split-brain.sh root";
        }
        net {
                after-sb-0pri discard-zero-changes;
                after-sb-1pri consensus;
                after-sb-2pri disconnect;
                data-integrity-alg md5;
                sndbuf-size 0;
        }
        syncer {
                rate 30M;
                verify-alg sha1;
                csums-alg md5;
        }
        on farm-ljf0 {
                address   10.31.99.165:7789;
        }
        on farm-ljf1 {
                address   10.31.99.166:7789;
        }
}
#########

farm-ljf1 used to be the master for all resources.  I stopped
corosync, intending to failover everything to farm-ljf0.  Since I did
that, here's how things look:
##########
[root at farm-ljf0 ~]# crm status
============
Last updated: Mon Oct  1 13:06:07 2012
Last change: Mon Oct  1 12:17:16 2012 via cibadmin on farm-ljf1
Stack: openais
Current DC: farm-ljf0 - partition WITHOUT quorum
Version: 1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ farm-ljf0 ]
OFFLINE: [ farm-ljf1 ]

 Master/Slave Set: FS0_Clone [FS0]
     Masters: [ farm-ljf0 ]
     Stopped: [ FS0:1 ]

Failed actions:
    FS0_drbd_start_0 (node=farm-ljf0, call=53, rc=1, status=complete):
unknown error
##########

I looked in /var/log/cluster/corosync.log from the time when I
attempted the failover, and spotted the following:
#########
Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: rsc:FS0_drbd:53: start
Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
(FS0_drbd:start:stderr) blockdev:
Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
(FS0_drbd:start:stderr) cannot open /dev/drbd0
Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
(FS0_drbd:start:stderr) :
Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
(FS0_drbd:start:stderr) Wrong medium type
Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
(FS0_drbd:start:stderr) mount: block device /dev/drbd0 is
write-protected, mounting read-only
Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: (FS0_drbd:start:stderr)
Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
(FS0_drbd:start:stderr) mount: Wrong medium type
Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: (FS0_drbd:start:stderr)
Oct 01 12:56:18 farm-ljf0 crmd: [927]: info: process_lrm_event: LRM
operation FS0_drbd_start_0 (call=53, rc=1, cib-update=532,
confirmed=true) unknown error
Oct 01 12:56:18 farm-ljf0 crmd: [927]: WARN: status_from_rc: Action 40
(FS0_drbd_start_0) on farm-ljf0 failed (target: 0 vs. rc: 1): Error
Oct 01 12:56:18 farm-ljf0 crmd: [927]: WARN: update_failcount:
Updating failcount for FS0_drbd on farm-ljf0 after failed start: rc=1
(update=INFINITY, time=1349121378)
Oct 01 12:56:18 farm-ljf0 crmd: [927]: info: abort_transition_graph:
match_graph_event:277 - Triggered transition abort (complete=0,
tag=lrm_rsc_op, id=FS0_drbd_last_failure_0, mag
ic=0:1;40:287:0:655c1af8-d2e8-4dfa-b084-4d4d36be8ade, cib=0.34.33) :
Event failed
#########

To my eyes, it looks like the attempt to mount the drbd backed storage
failed.  I don't understand why, as I can manually mount it using the
exact same parameters in the configuration (which worked fine on the
master) after the failover.  Perhaps there's some weird race condition
occurring where it tries to mount before the drbd server has failed
over?

None of that explains why the failover IP didn't come up on the (old)
slave.  I don't see any errors or failures in the log with respect to
ClusterIP.  All I see is:
#########
Oct 01 12:56:17 farm-ljf0 pengine: [926]: notice: LogActions: Move
ClusterIP (Started farm-ljf1 -> farm-ljf0)
Oct 01 12:56:17 farm-ljf0 crmd: [927]: info: te_rsc_command:
Initiating action 41: stop ClusterIP_stop_0 on farm-ljf1
#########

It looks like it never even tries to bring it up on the (old) slave.

Anyway, here's the configuration that I was using when all of the
above transpired:
##########
[root at farm-ljf0 ~]# crm configure show
node farm-ljf0 \
        attributes standby="off"
node farm-ljf1
primitive ClusterIP ocf:heartbeat:IPaddr2 \
        params ip="10.31.97.100" cidr_netmask="22" nic="eth1" \
        op monitor interval="10s" \
        meta target-role="Started"
primitive FS0 ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="10s" role="Master" \
        op monitor interval="30s" role="Slave"
primitive FS0_drbd ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" directory="/mnt/sdb1" fstype="xfs"
group g_services FS0_drbd ClusterIP
ms FS0_Clone FS0 \
        meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
location cli-prefer-ClusterIP ClusterIP \
        rule $id="cli-prefer-rule-ClusterIP" inf: #uname eq farm-ljf1
colocation fs0_on_drbd inf: g_services FS0_Clone:Master
order FS0_drbd-after-FS0 inf: FS0_Clone:promote g_services
property $id="cib-bootstrap-options" \
        dc-version="1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"
##########

thanks!