Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
For consistency I am posting this reply to both lists the question was asked on. My apologies to those that are subscribed to both! ----- Original Message ----- > From: "Lonni J Friedman" <netllama at gmail.com> > To: drbd-user at lists.linbit.com > Sent: Monday, October 1, 2012 4:36:40 PM > Subject: [DRBD-user] two node (master/slave) failover not working > > Greetings, > I've just started playing with pacemaker/corosync on a two node > setup. > Prior to doing anything with pacemaker & corosync, the DRBD setup > was > working fine > At this point I'm just experimenting, and trying to get a good feel > of how things work. Eventually I'd like to start using this in a > production environment. I'm running Fedora16-x86_64 with > drbd-8.3.11, > pacemaker-1.1.7 & corosync-1.4.3. I've verified that pacemaker is > doing the right > thing when initially configured. Specifically: > * the floating static IP is brought up on the master > * DRBD is brought up correctly with a master & slave in sync > * the local DRBD backed storage filesystem mount point is mounted > correctly on the master > > Here's the drbd resource configuration: > ######### > resource r0 { > device /dev/drbd0; > disk /dev/sdb1; > meta-disk internal; > handlers { > split-brain "/usr/lib/drbd/notify-split-brain.sh > root"; > } > net { > after-sb-0pri discard-zero-changes; > after-sb-1pri consensus; > after-sb-2pri disconnect; > data-integrity-alg md5; > sndbuf-size 0; > } > syncer { > rate 30M; > verify-alg sha1; > csums-alg md5; > } > on farm-ljf0 { > address 10.31.99.165:7789; > } > on farm-ljf1 { > address 10.31.99.166:7789; > } > } > ######### > > > farm-ljf1 used to be the master for all resources. I stopped > corosync, intending to failover everything to farm-ljf0. Since I did > that, here's how things look: > ########## > [root at farm-ljf0 ~]# crm status > ============ > Last updated: Mon Oct 1 13:06:07 2012 > Last change: Mon Oct 1 12:17:16 2012 via cibadmin on farm-ljf1 > Stack: openais > Current DC: farm-ljf0 - partition WITHOUT quorum > Version: 1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff > 2 Nodes configured, 2 expected votes > 4 Resources configured. > ============ > > Online: [ farm-ljf0 ] > OFFLINE: [ farm-ljf1 ] > > Master/Slave Set: FS0_Clone [FS0] > Masters: [ farm-ljf0 ] > Stopped: [ FS0:1 ] > > Failed actions: > FS0_drbd_start_0 (node=farm-ljf0, call=53, rc=1, > status=complete): > unknown error > ########## > > I looked in /var/log/cluster/corosync.log from the time when I > attempted the failover, and spotted the following: > ######### > Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: rsc:FS0_drbd:53: start > Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: > (FS0_drbd:start:stderr) blockdev: > Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: > (FS0_drbd:start:stderr) cannot open /dev/drbd0 > Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: > (FS0_drbd:start:stderr) : > Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: > (FS0_drbd:start:stderr) Wrong medium type > Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: > (FS0_drbd:start:stderr) mount: block device /dev/drbd0 is > write-protected, mounting read-only > Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: > (FS0_drbd:start:stderr) > Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: > (FS0_drbd:start:stderr) mount: Wrong medium type > Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: > (FS0_drbd:start:stderr) > Oct 01 12:56:18 farm-ljf0 crmd: [927]: info: process_lrm_event: LRM > operation FS0_drbd_start_0 (call=53, rc=1, cib-update=532, > confirmed=true) unknown error > Oct 01 12:56:18 farm-ljf0 crmd: [927]: WARN: status_from_rc: Action > 40 > (FS0_drbd_start_0) on farm-ljf0 failed (target: 0 vs. rc: 1): Error > Oct 01 12:56:18 farm-ljf0 crmd: [927]: WARN: update_failcount: > Updating failcount for FS0_drbd on farm-ljf0 after failed start: rc=1 > (update=INFINITY, time=1349121378) > Oct 01 12:56:18 farm-ljf0 crmd: [927]: info: abort_transition_graph: > match_graph_event:277 - Triggered transition abort (complete=0, > tag=lrm_rsc_op, id=FS0_drbd_last_failure_0, mag > ic=0:1;40:287:0:655c1af8-d2e8-4dfa-b084-4d4d36be8ade, cib=0.34.33) : > Event failed > ######### > > To my eyes, it looks like the attempt to mount the drbd backed > storage > failed. I don't understand why, as I can manually mount it using the > exact same parameters in the configuration (which worked fine on the > master) after the failover. Perhaps there's some weird race > condition > occurring where it tries to mount before the drbd server has failed > over? > > None of that explains why the failover IP didn't come up on the (old) > slave. I don't see any errors or failures in the log with respect to > ClusterIP. All I see is: > ######### > Oct 01 12:56:17 farm-ljf0 pengine: [926]: notice: LogActions: Move > ClusterIP (Started farm-ljf1 -> farm-ljf0) > Oct 01 12:56:17 farm-ljf0 crmd: [927]: info: te_rsc_command: > Initiating action 41: stop ClusterIP_stop_0 on farm-ljf1 > ######### > > It looks like it never even tries to bring it up on the (old) slave. > > Anyway, here's the configuration that I was using when all of the > above transpired: > ########## > [root at farm-ljf0 ~]# crm configure show > node farm-ljf0 \ > attributes standby="off" > node farm-ljf1 > primitive ClusterIP ocf:heartbeat:IPaddr2 \ > params ip="10.31.97.100" cidr_netmask="22" nic="eth1" \ > op monitor interval="10s" \ > meta target-role="Started" > primitive FS0 ocf:linbit:drbd \ > params drbd_resource="r0" \ > op monitor interval="10s" role="Master" \ > op monitor interval="30s" role="Slave" > primitive FS0_drbd ocf:heartbeat:Filesystem \ > params device="/dev/drbd0" directory="/mnt/sdb1" fstype="xfs" > group g_services FS0_drbd ClusterIP > ms FS0_Clone FS0 \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > location cli-prefer-ClusterIP ClusterIP \ > rule $id="cli-prefer-rule-ClusterIP" inf: #uname eq farm-ljf1 This location constraint prevents ClusterIP from running on a node that isn't named farm-ljf1 because it has a score of infinity. If you want the preference to be node farm-ljf1 then set it to something like 100:. > colocation fs0_on_drbd inf: g_services FS0_Clone:Master > order FS0_drbd-after-FS0 inf: FS0_Clone:promote g_services When you specify actions for a resource in an order statement they are inherited by all the remaining resources unless explicitly defined - so this ends up being: order FS0__drbd-after-FS0 inf: FS0_Clone:promote g_services:promote Can't promote the resources that are part of the g_services group (not supported action). Should change this to be: order FS0_drbd-after-FS0 inf: FS0_Clone:promote g_services:start HTH Jake > property $id="cib-bootstrap-options" \ > dc-version="1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff" > \ > cluster-infrastructure="openais" \ > expected-quorum-votes="2" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" > ########## > > > thanks! > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > >