Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I have been preparing to move some of our services to pacemaker from heartbeat, and I've been trying various versions of the pieces. Currently a very good combination -- we're using Debian lenny with 64 bit architecture -- is this: drbd8-utils and kernel module: version 8.3.8 cluster-agents version 1.0.3-3 cluster-glue version 1.0.5-2 corosync version 1.2.1-1 heartbeat version 3.0.3-2 pacemaker version 1.0.9.1-1 pacemaker-mgmt version 2.0.0-0 The test service is a mysql data base, which is run on a drbd partition mounted as an ext3 file system. Once the system is running, it has an excellent failover performance. The difficulty is that when a client reboots, drbd is starting, but then appears to get switched off by pacemaker. I can then log in and start it manually, and pacemaker takes over successfully as far as failover goes. Here are some relevant log messages from the syslog: The machine boots and drbd is initialised: Nov 24 10:44:37 crate kernel: [ 5.958176] drbd: initialized. Version: 8.3.8 (api:88/proto:86-94) Nov 24 10:44:37 crate kernel: [ 5.958187] drbd: GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by root at byrd, 2010-11-24 07:59:52 Nov 24 10:44:37 crate kernel: [ 5.958194] drbd: registered as block device major 147 Nov 24 10:44:37 crate kernel: [ 5.958198] drbd: minor_table @ 0xffff8800fe345c00 Nov 24 10:44:37 crate kernel: [ 6.002434] block drbd0: Starting worker thread (from cqueue [1040]) Nov 24 10:44:37 crate kernel: klogd 1.5.0, ---------- state change ---------- Nov 24 10:44:37 crate kernel: [ 6.004881] block drbd0: disk( Diskless -> Attaching ) Nov 24 10:44:37 crate kernel: [ 6.006149] block drbd0: Found 6 transactions (10 active extents) in activity log. Nov 24 10:44:37 crate kernel: [ 6.006158] block drbd0: Method to ensure write ordering: barrier Nov 24 10:44:37 crate kernel: [ 6.006164] block drbd0: max_segment_size ( = BIO size ) = 4096 Nov 24 10:44:37 crate kernel: [ 6.006169] block drbd0: drbd_bm_resize called with capacity == 62912568 Nov 24 10:44:37 crate kernel: [ 6.006358] block drbd0: resync bitmap: bits=7864071 words=122877 Nov 24 10:44:37 crate kernel: [ 6.006364] block drbd0: size = 30 GB (31456284 KB) Nov 24 10:44:37 crate kernel: [ 6.008444] block drbd0: recounting of set bits took additional 1 jiffies Nov 24 10:44:37 crate kernel: [ 6.008453] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Nov 24 10:44:37 crate kernel: [ 6.008462] block drbd0: Marked additional 24 MB as out-of-sync based on AL. Nov 24 10:44:37 crate kernel: [ 6.008897] block drbd0: disk( Attaching -> UpToDate ) Nov 24 10:44:37 crate kernel: [ 6.018084] block drbd0: conn( StandAlone -> Unconnected ) Nov 24 10:44:37 crate kernel: [ 6.018108] block drbd0: Starting receiver thread (from drbd0_worker [1048]) Nov 24 10:44:37 crate kernel: [ 6.020691] block drbd0: receiver (re)started Nov 24 10:44:37 crate kernel: [ 6.020701] block drbd0: conn( Unconnected -> WFConnection ) Nov 24 10:44:37 crate kernel: [ 6.121932] block drbd0: Handshake successful: Agreed network protocol version 94 Nov 24 10:44:37 crate kernel: [ 6.121948] block drbd0: conn( WFConnection -> WFReportParams ) Nov 24 10:44:37 crate kernel: [ 6.121987] block drbd0: Starting asender thread (from drbd0_receiver [1057]) Nov 24 10:44:37 crate kernel: [ 6.122062] block drbd0: data-integrity-alg: <not-used> Nov 24 10:44:37 crate kernel: [ 6.122155] block drbd0: max_segment_size ( = BIO size ) = 4096 Nov 24 10:44:37 crate kernel: [ 6.122168] block drbd0: drbd_sync_handshake: Nov 24 10:44:37 crate kernel: [ 6.122177] block drbd0: self 8E146413568764F4:0000000000000000:E8FF2CDAEC98CDBC:9D3893C50146C439 bits:6144 flags:0 Nov 24 10:44:37 crate kernel: [ 6.122188] block drbd0: peer 33430C3DE89049B3:8E146413568764F4:E8FF2CDAEC98CDBC:9D3893C50146C439 bits:24 flags:0 Nov 24 10:44:37 crate kernel: [ 6.122198] block drbd0: uuid_compare()=-1 by rule 50 Nov 24 10:44:37 crate kernel: [ 6.122207] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Nov 24 10:44:37 crate kernel: [ 6.152120] block drbd0: conn( WFBitMapT -> WFSyncUUID ) Nov 24 10:44:37 crate kernel: [ 6.155392] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 Nov 24 10:44:37 crate kernel: [ 6.162690] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) Nov 24 10:44:37 crate kernel: [ 6.162701] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) Nov 24 10:44:37 crate kernel: [ 6.162709] block drbd0: Began resync as SyncTarget (will sync 24580 KB [6145 bits set]). After this corosync starts: Nov 24 10:44:37 crate corosync[1105]: [MAIN ] Corosync Cluster Engine ('1.2.1'): started and ready to provide service. and soon established synchronisation: Nov 24 10:44:57 crate corosync[1105]: [MAIN ] Completed service synchronization, ready to provide service. Soon it starts probing the services, and then this: Nov 24 10:44:57 crate mysql[1268]: ERROR: Config /etc/mysql/my.cnf doesn't exist Nov 24 10:44:57 crate crmd: [1147]: info: process_lrm_event: LRM operation res-Mysql_monitor_0 (call=2, rc=7, cib-update=7, confirmed=true) not running Nov 24 10:44:57 crate crmd: [1147]: info: process_lrm_event: LRM operation res-Wmysql3-IP_monitor_0 (call=4, rc=7, cib-update=8, confirmed=true) not running Nov 24 10:44:57 crate attrd: [1145]: info: attrd_trigger_update: Sending flush op to all hosts for: master-res-Drbd-R0:0 (10000) Nov 24 10:44:57 crate attrd: [1145]: info: attrd_perform_update: Sent update 15: master-res-Drbd-R0:0=10000 Nov 24 10:44:57 crate crmd: [1147]: info: process_lrm_event: LRM operation res-Drbd-R0:0_monitor_0 (call=5, rc=0, cib-update=9, confirmed=true) ok Nov 24 10:44:57 crate crmd: [1147]: info: process_lrm_event: LRM operation res-Wmysql3-Fs_monitor_0 (call=3, rc=7, cib-update=10, confirmed=true) not running Nov 24 10:44:58 crate lrmd: [1144]: info: rsc:res-Ping:0:6: probe Nov 24 10:44:58 crate crmd: [1147]: info: process_lrm_event: LRM operation res-Ping:0_monitor_0 (call=6, rc=7, cib-update=11, confirmed=true) not running Nov 24 10:44:58 crate attrd: [1145]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Nov 24 10:44:58 crate attrd: [1145]: info: attrd_perform_update: Sent update 18: probe_complete=true Nov 24 10:44:58 crate crmd: [1147]: info: do_lrm_rsc_op: Performing key=61:6:0:d7f80eee-bab7-46fc-9cec-cd9f26ae852b op=res-Drbd-R0:0_notify_0 ) Nov 24 10:44:58 crate lrmd: [1144]: info: rsc:res-Drbd-R0:0:7: notify Nov 24 10:44:58 crate crmd: [1147]: info: do_lrm_rsc_op: Performing key=45:6:0:d7f80eee-bab7-46fc-9cec-cd9f26ae852b op=res-Ping:0_start_0 ) Nov 24 10:44:58 crate lrmd: [1144]: info: rsc:res-Ping:0:8: start Nov 24 10:44:58 crate crmd: [1147]: info: process_lrm_event: LRM operation res-Drbd-R0:0_notify_0 (call=7, rc=0, cib-update=12, confirmed=true) ok Nov 24 10:44:58 crate crmd: [1147]: info: do_lrm_rsc_op: Performing key=16:6:0:d7f80eee-bab7-46fc-9cec-cd9f26ae852b op=res-Drbd-R0:0_stop_0 ) Nov 24 10:44:58 crate lrmd: [1144]: info: rsc:res-Drbd-R0:0:9: stop Nov 24 10:44:58 crate lrmd: [1144]: info: RA output: (res-Drbd-R0:0:stop:stdout) Nov 24 10:44:58 crate kernel: [ 27.313216] block drbd0: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown ) Nov 24 10:44:58 crate kernel: [ 27.313254] block drbd0: short read expecting header on sock: r=-512 Nov 24 10:44:58 crate kernel: [ 27.313286] block drbd0: asender terminated Nov 24 10:44:58 crate kernel: [ 27.313292] block drbd0: Terminating asender thread Nov 24 10:44:58 crate kernel: [ 27.313603] block drbd0: Connection closed Nov 24 10:44:58 crate kernel: [ 27.313638] block drbd0: conn( Disconnecting -> StandAlone ) Nov 24 10:44:58 crate kernel: [ 27.313661] block drbd0: receiver terminated Nov 24 10:44:58 crate kernel: [ 27.313666] block drbd0: Terminating receiver thread Nov 24 10:44:58 crate kernel: [ 27.313753] block drbd0: disk( UpToDate -> Diskless ) Nov 24 10:44:58 crate kernel: [ 27.313820] block drbd0: drbd_bm_resize called with capacity == 0 Nov 24 10:44:58 crate kernel: [ 27.313892] block drbd0: worker terminated Nov 24 10:44:58 crate kernel: [ 27.313897] block drbd0: Terminating worker thread Nov 24 10:44:58 crate kernel: [ 27.322840] block drbd0: State change failed: Disk state is lower than outdated Nov 24 10:44:58 crate kernel: [ 27.322850] block drbd0: state = { cs:StandAlone ro:Secondary/Unknown ds:Diskless/DUnknown r--- } Nov 24 10:44:58 crate kernel: [ 27.322857] block drbd0: wanted = { cs:StandAlone ro:Secondary/Unknown ds:Outdated/DUnknown r--- } Nov 24 10:44:58 crate lrmd: [1144]: info: RA output: (res-Drbd-R0:0:stop:stdout) Nov 24 10:44:58 crate attrd: [1145]: info: attrd_trigger_update: Sending flush op to all hosts for: master-res-Drbd-R0:0 (-INFINITY) Nov 24 10:44:58 crate attrd: [1145]: info: attrd_perform_update: Sent update 20: master-res-Drbd-R0:0=-INFINITY Nov 24 10:44:58 crate lrmd: [1144]: info: RA output: (res-Drbd-R0:0:stop:stdout) Now, I'd like to know why it is shutting drbd down. If I cat /proc/drbd before pacemaker gets going, I see this: # cat /proc/drbd version: 8.3.8 (api:88/proto:86-94) GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by root at byrd, 2010-11-24 07:59:52 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r---- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 Then, two seconds later: # cat /proc/drbd version: 8.3.8 (api:88/proto:86-94) GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by root at byrd, 2010-11-24 07:59:52 0: cs:Unconfigured Since it is entirely possible that this is caused by misconfiguring pacemaker, I include the configuration below. -- Owen Le Blanc node crate node creel primitive res-Drbd-R0 ocf:linbit:drbd \ operations $id="res-Drbd-R0-operations" \ op monitor interval="20" role="Slave" timeout="20" start-delay="1m" \ op monitor interval="10" role="Master" timeout="20" start-delay="1m" \ params drbd_resource="r0" primitive res-Mysql ocf:heartbeat:mysql \ meta target-role="Started" \ operations $id="res-Mysql-operations" \ op monitor interval="10" timeout="30" \ params binary="/usr/bin/mysqld_safe" config="/etc/mysql/my.cnf" pid="/var/run/mysqld/mysqld.pid" socket="/var/run/mysqld/mysql.sock" primitive res-Ping ocf:pacemaker:ping \ operations $id="res-Ping-operations" \ op monitor interval="10" timeout="60" \ params dampen="10s" host_list="130.88.212.250 130.88.203.250 130.88.200.250" multiplier="100" primitive res-Wmysql3-Fs ocf:heartbeat:Filesystem \ meta target-role="started" \ operations $id="res-Wmysql3-Fs-operations" \ op monitor interval="20" timeout="40" \ params device="/dev/drbd0" directory="/var/data" fstype="ext3" primitive res-Wmysql3-IP ocf:heartbeat:IPaddr2 \ meta target-role="started" \ operations $id="res-Wmysql3-IP-operations" \ op monitor interval="10s" timeout="20s" \ params ip ip="10.99.212.247" ms ms-Drbd-R0 res-Drbd-R0 \ meta target-role="Master" clone-max="2" globally-unique="false" notify="true" is-managed="true" clone clo-Ping res-Ping \ meta target-role="started" colocation col-Fs-Drbd +inf: res-Wmysql3-Fs ms-Drbd-R0:Master colocation col-Fs-Mysql +inf: res-Mysql res-Wmysql3-Fs colocation col-Ip-Drbd +inf: ms-Drbd-R0 res-Wmysql3-IP order ord-Drbd-Fs +inf: ms-Drbd-R0:promote res-Wmysql3-Fs:start order ord-Fs-Mysql +inf: res-Wmysql3-Fs res-Mysql order ord-Ip-Drbd +inf: res-Wmysql3-IP ms-Drbd-R0 property $id="cib-bootstrap-options" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ default-resource-stickiness="20" \ no-quorum-policy="ignore" \ stonith-enabled="false" \ default-action-timeout="60s" \ dc-version="1.0.9-unknown"