Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
For consistency I am posting this reply to both lists the question was asked on. My apologies to those that are subscribed to both!
----- Original Message -----
> From: "Lonni J Friedman" <netllama at gmail.com>
> To: drbd-user at lists.linbit.com
> Sent: Monday, October 1, 2012 4:36:40 PM
> Subject: [DRBD-user] two node (master/slave) failover not working
>
> Greetings,
> I've just started playing with pacemaker/corosync on a two node
> setup.
> Prior to doing anything with pacemaker & corosync, the DRBD setup
> was
> working fine
> At this point I'm just experimenting, and trying to get a good feel
> of how things work. Eventually I'd like to start using this in a
> production environment. I'm running Fedora16-x86_64 with
> drbd-8.3.11,
> pacemaker-1.1.7 & corosync-1.4.3. I've verified that pacemaker is
> doing the right
> thing when initially configured. Specifically:
> * the floating static IP is brought up on the master
> * DRBD is brought up correctly with a master & slave in sync
> * the local DRBD backed storage filesystem mount point is mounted
> correctly on the master
>
> Here's the drbd resource configuration:
> #########
> resource r0 {
> device /dev/drbd0;
> disk /dev/sdb1;
> meta-disk internal;
> handlers {
> split-brain "/usr/lib/drbd/notify-split-brain.sh
> root";
> }
> net {
> after-sb-0pri discard-zero-changes;
> after-sb-1pri consensus;
> after-sb-2pri disconnect;
> data-integrity-alg md5;
> sndbuf-size 0;
> }
> syncer {
> rate 30M;
> verify-alg sha1;
> csums-alg md5;
> }
> on farm-ljf0 {
> address 10.31.99.165:7789;
> }
> on farm-ljf1 {
> address 10.31.99.166:7789;
> }
> }
> #########
>
>
> farm-ljf1 used to be the master for all resources. I stopped
> corosync, intending to failover everything to farm-ljf0. Since I did
> that, here's how things look:
> ##########
> [root at farm-ljf0 ~]# crm status
> ============
> Last updated: Mon Oct 1 13:06:07 2012
> Last change: Mon Oct 1 12:17:16 2012 via cibadmin on farm-ljf1
> Stack: openais
> Current DC: farm-ljf0 - partition WITHOUT quorum
> Version: 1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> ============
>
> Online: [ farm-ljf0 ]
> OFFLINE: [ farm-ljf1 ]
>
> Master/Slave Set: FS0_Clone [FS0]
> Masters: [ farm-ljf0 ]
> Stopped: [ FS0:1 ]
>
> Failed actions:
> FS0_drbd_start_0 (node=farm-ljf0, call=53, rc=1,
> status=complete):
> unknown error
> ##########
>
> I looked in /var/log/cluster/corosync.log from the time when I
> attempted the failover, and spotted the following:
> #########
> Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: rsc:FS0_drbd:53: start
> Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
> (FS0_drbd:start:stderr) blockdev:
> Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
> (FS0_drbd:start:stderr) cannot open /dev/drbd0
> Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
> (FS0_drbd:start:stderr) :
> Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
> (FS0_drbd:start:stderr) Wrong medium type
> Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
> (FS0_drbd:start:stderr) mount: block device /dev/drbd0 is
> write-protected, mounting read-only
> Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
> (FS0_drbd:start:stderr)
> Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
> (FS0_drbd:start:stderr) mount: Wrong medium type
> Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output:
> (FS0_drbd:start:stderr)
> Oct 01 12:56:18 farm-ljf0 crmd: [927]: info: process_lrm_event: LRM
> operation FS0_drbd_start_0 (call=53, rc=1, cib-update=532,
> confirmed=true) unknown error
> Oct 01 12:56:18 farm-ljf0 crmd: [927]: WARN: status_from_rc: Action
> 40
> (FS0_drbd_start_0) on farm-ljf0 failed (target: 0 vs. rc: 1): Error
> Oct 01 12:56:18 farm-ljf0 crmd: [927]: WARN: update_failcount:
> Updating failcount for FS0_drbd on farm-ljf0 after failed start: rc=1
> (update=INFINITY, time=1349121378)
> Oct 01 12:56:18 farm-ljf0 crmd: [927]: info: abort_transition_graph:
> match_graph_event:277 - Triggered transition abort (complete=0,
> tag=lrm_rsc_op, id=FS0_drbd_last_failure_0, mag
> ic=0:1;40:287:0:655c1af8-d2e8-4dfa-b084-4d4d36be8ade, cib=0.34.33) :
> Event failed
> #########
>
> To my eyes, it looks like the attempt to mount the drbd backed
> storage
> failed. I don't understand why, as I can manually mount it using the
> exact same parameters in the configuration (which worked fine on the
> master) after the failover. Perhaps there's some weird race
> condition
> occurring where it tries to mount before the drbd server has failed
> over?
>
> None of that explains why the failover IP didn't come up on the (old)
> slave. I don't see any errors or failures in the log with respect to
> ClusterIP. All I see is:
> #########
> Oct 01 12:56:17 farm-ljf0 pengine: [926]: notice: LogActions: Move
> ClusterIP (Started farm-ljf1 -> farm-ljf0)
> Oct 01 12:56:17 farm-ljf0 crmd: [927]: info: te_rsc_command:
> Initiating action 41: stop ClusterIP_stop_0 on farm-ljf1
> #########
>
> It looks like it never even tries to bring it up on the (old) slave.
>
> Anyway, here's the configuration that I was using when all of the
> above transpired:
> ##########
> [root at farm-ljf0 ~]# crm configure show
> node farm-ljf0 \
> attributes standby="off"
> node farm-ljf1
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
> params ip="10.31.97.100" cidr_netmask="22" nic="eth1" \
> op monitor interval="10s" \
> meta target-role="Started"
> primitive FS0 ocf:linbit:drbd \
> params drbd_resource="r0" \
> op monitor interval="10s" role="Master" \
> op monitor interval="30s" role="Slave"
> primitive FS0_drbd ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/mnt/sdb1" fstype="xfs"
> group g_services FS0_drbd ClusterIP
> ms FS0_Clone FS0 \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> location cli-prefer-ClusterIP ClusterIP \
> rule $id="cli-prefer-rule-ClusterIP" inf: #uname eq farm-ljf1
This location constraint prevents ClusterIP from running on a node that isn't named farm-ljf1 because it has a score of infinity. If you want the preference to be node farm-ljf1 then set it to something like 100:.
> colocation fs0_on_drbd inf: g_services FS0_Clone:Master
> order FS0_drbd-after-FS0 inf: FS0_Clone:promote g_services
When you specify actions for a resource in an order statement they are inherited by all the remaining resources unless explicitly defined - so this ends up being:
order FS0__drbd-after-FS0 inf: FS0_Clone:promote g_services:promote
Can't promote the resources that are part of the g_services group (not supported action). Should change this to be:
order FS0_drbd-after-FS0 inf: FS0_Clone:promote g_services:start
HTH
Jake
> property $id="cib-bootstrap-options" \
> dc-version="1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff"
> \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> no-quorum-policy="ignore"
> ##########
>
>
> thanks!
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>