Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Greetings, I've just started playing with pacemaker/corosync on a two node setup. Prior to doing anything with pacemaker & corosync, the DRBD setup was working fine At this point I'm just experimenting, and trying to get a good feel of how things work. Eventually I'd like to start using this in a production environment. I'm running Fedora16-x86_64 with drbd-8.3.11, pacemaker-1.1.7 & corosync-1.4.3. I've verified that pacemaker is doing the right thing when initially configured. Specifically: * the floating static IP is brought up on the master * DRBD is brought up correctly with a master & slave in sync * the local DRBD backed storage filesystem mount point is mounted correctly on the master Here's the drbd resource configuration: ######### resource r0 { device /dev/drbd0; disk /dev/sdb1; meta-disk internal; handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh root"; } net { after-sb-0pri discard-zero-changes; after-sb-1pri consensus; after-sb-2pri disconnect; data-integrity-alg md5; sndbuf-size 0; } syncer { rate 30M; verify-alg sha1; csums-alg md5; } on farm-ljf0 { address 10.31.99.165:7789; } on farm-ljf1 { address 10.31.99.166:7789; } } ######### farm-ljf1 used to be the master for all resources. I stopped corosync, intending to failover everything to farm-ljf0. Since I did that, here's how things look: ########## [root at farm-ljf0 ~]# crm status ============ Last updated: Mon Oct 1 13:06:07 2012 Last change: Mon Oct 1 12:17:16 2012 via cibadmin on farm-ljf1 Stack: openais Current DC: farm-ljf0 - partition WITHOUT quorum Version: 1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 4 Resources configured. ============ Online: [ farm-ljf0 ] OFFLINE: [ farm-ljf1 ] Master/Slave Set: FS0_Clone [FS0] Masters: [ farm-ljf0 ] Stopped: [ FS0:1 ] Failed actions: FS0_drbd_start_0 (node=farm-ljf0, call=53, rc=1, status=complete): unknown error ########## I looked in /var/log/cluster/corosync.log from the time when I attempted the failover, and spotted the following: ######### Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: rsc:FS0_drbd:53: start Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: (FS0_drbd:start:stderr) blockdev: Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: (FS0_drbd:start:stderr) cannot open /dev/drbd0 Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: (FS0_drbd:start:stderr) : Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: (FS0_drbd:start:stderr) Wrong medium type Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: (FS0_drbd:start:stderr) mount: block device /dev/drbd0 is write-protected, mounting read-only Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: (FS0_drbd:start:stderr) Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: (FS0_drbd:start:stderr) mount: Wrong medium type Oct 01 12:56:18 farm-ljf0 lrmd: [924]: info: RA output: (FS0_drbd:start:stderr) Oct 01 12:56:18 farm-ljf0 crmd: [927]: info: process_lrm_event: LRM operation FS0_drbd_start_0 (call=53, rc=1, cib-update=532, confirmed=true) unknown error Oct 01 12:56:18 farm-ljf0 crmd: [927]: WARN: status_from_rc: Action 40 (FS0_drbd_start_0) on farm-ljf0 failed (target: 0 vs. rc: 1): Error Oct 01 12:56:18 farm-ljf0 crmd: [927]: WARN: update_failcount: Updating failcount for FS0_drbd on farm-ljf0 after failed start: rc=1 (update=INFINITY, time=1349121378) Oct 01 12:56:18 farm-ljf0 crmd: [927]: info: abort_transition_graph: match_graph_event:277 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=FS0_drbd_last_failure_0, mag ic=0:1;40:287:0:655c1af8-d2e8-4dfa-b084-4d4d36be8ade, cib=0.34.33) : Event failed ######### To my eyes, it looks like the attempt to mount the drbd backed storage failed. I don't understand why, as I can manually mount it using the exact same parameters in the configuration (which worked fine on the master) after the failover. Perhaps there's some weird race condition occurring where it tries to mount before the drbd server has failed over? None of that explains why the failover IP didn't come up on the (old) slave. I don't see any errors or failures in the log with respect to ClusterIP. All I see is: ######### Oct 01 12:56:17 farm-ljf0 pengine: [926]: notice: LogActions: Move ClusterIP (Started farm-ljf1 -> farm-ljf0) Oct 01 12:56:17 farm-ljf0 crmd: [927]: info: te_rsc_command: Initiating action 41: stop ClusterIP_stop_0 on farm-ljf1 ######### It looks like it never even tries to bring it up on the (old) slave. Anyway, here's the configuration that I was using when all of the above transpired: ########## [root at farm-ljf0 ~]# crm configure show node farm-ljf0 \ attributes standby="off" node farm-ljf1 primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip="10.31.97.100" cidr_netmask="22" nic="eth1" \ op monitor interval="10s" \ meta target-role="Started" primitive FS0 ocf:linbit:drbd \ params drbd_resource="r0" \ op monitor interval="10s" role="Master" \ op monitor interval="30s" role="Slave" primitive FS0_drbd ocf:heartbeat:Filesystem \ params device="/dev/drbd0" directory="/mnt/sdb1" fstype="xfs" group g_services FS0_drbd ClusterIP ms FS0_Clone FS0 \ meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" location cli-prefer-ClusterIP ClusterIP \ rule $id="cli-prefer-rule-ClusterIP" inf: #uname eq farm-ljf1 colocation fs0_on_drbd inf: g_services FS0_Clone:Master order FS0_drbd-after-FS0 inf: FS0_Clone:promote g_services property $id="cib-bootstrap-options" \ dc-version="1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" ########## thanks!