Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I have been preparing to move some of our services to pacemaker from
heartbeat, and I've been trying various versions of the pieces.
Currently a very good combination -- we're using Debian lenny with
64 bit architecture -- is this:
drbd8-utils and kernel module: version 8.3.8
cluster-agents version 1.0.3-3
cluster-glue version 1.0.5-2
corosync version 1.2.1-1
heartbeat version 3.0.3-2
pacemaker version 1.0.9.1-1
pacemaker-mgmt version 2.0.0-0
The test service is a mysql data base, which is run on a drbd partition
mounted as an ext3 file system. Once the system is running, it has
an excellent failover performance. The difficulty is that when a
client reboots, drbd is starting, but then appears to get switched
off by pacemaker. I can then log in and start it manually, and
pacemaker takes over successfully as far as failover goes.
Here are some relevant log messages from the syslog:
The machine boots and drbd is initialised:
Nov 24 10:44:37 crate kernel: [ 5.958176] drbd: initialized. Version: 8.3.8 (api:88/proto:86-94)
Nov 24 10:44:37 crate kernel: [ 5.958187] drbd: GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by root at byrd, 2010-11-24 07:59:52
Nov 24 10:44:37 crate kernel: [ 5.958194] drbd: registered as block device major 147
Nov 24 10:44:37 crate kernel: [ 5.958198] drbd: minor_table @ 0xffff8800fe345c00
Nov 24 10:44:37 crate kernel: [ 6.002434] block drbd0: Starting worker thread (from cqueue [1040])
Nov 24 10:44:37 crate kernel: klogd 1.5.0, ---------- state change ----------
Nov 24 10:44:37 crate kernel: [ 6.004881] block drbd0: disk( Diskless -> Attaching )
Nov 24 10:44:37 crate kernel: [ 6.006149] block drbd0: Found 6 transactions (10 active extents) in activity log.
Nov 24 10:44:37 crate kernel: [ 6.006158] block drbd0: Method to ensure write ordering: barrier
Nov 24 10:44:37 crate kernel: [ 6.006164] block drbd0: max_segment_size ( = BIO size ) = 4096
Nov 24 10:44:37 crate kernel: [ 6.006169] block drbd0: drbd_bm_resize called with capacity == 62912568
Nov 24 10:44:37 crate kernel: [ 6.006358] block drbd0: resync bitmap: bits=7864071 words=122877
Nov 24 10:44:37 crate kernel: [ 6.006364] block drbd0: size = 30 GB (31456284 KB)
Nov 24 10:44:37 crate kernel: [ 6.008444] block drbd0: recounting of set bits took additional 1 jiffies
Nov 24 10:44:37 crate kernel: [ 6.008453] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Nov 24 10:44:37 crate kernel: [ 6.008462] block drbd0: Marked additional 24 MB as out-of-sync based on AL.
Nov 24 10:44:37 crate kernel: [ 6.008897] block drbd0: disk( Attaching -> UpToDate )
Nov 24 10:44:37 crate kernel: [ 6.018084] block drbd0: conn( StandAlone -> Unconnected )
Nov 24 10:44:37 crate kernel: [ 6.018108] block drbd0: Starting receiver thread (from drbd0_worker [1048])
Nov 24 10:44:37 crate kernel: [ 6.020691] block drbd0: receiver (re)started
Nov 24 10:44:37 crate kernel: [ 6.020701] block drbd0: conn( Unconnected -> WFConnection )
Nov 24 10:44:37 crate kernel: [ 6.121932] block drbd0: Handshake successful: Agreed network protocol version 94
Nov 24 10:44:37 crate kernel: [ 6.121948] block drbd0: conn( WFConnection -> WFReportParams )
Nov 24 10:44:37 crate kernel: [ 6.121987] block drbd0: Starting asender thread (from drbd0_receiver [1057])
Nov 24 10:44:37 crate kernel: [ 6.122062] block drbd0: data-integrity-alg: <not-used>
Nov 24 10:44:37 crate kernel: [ 6.122155] block drbd0: max_segment_size ( = BIO size ) = 4096
Nov 24 10:44:37 crate kernel: [ 6.122168] block drbd0: drbd_sync_handshake:
Nov 24 10:44:37 crate kernel: [ 6.122177] block drbd0: self 8E146413568764F4:0000000000000000:E8FF2CDAEC98CDBC:9D3893C50146C439 bits:6144 flags:0
Nov 24 10:44:37 crate kernel: [ 6.122188] block drbd0: peer 33430C3DE89049B3:8E146413568764F4:E8FF2CDAEC98CDBC:9D3893C50146C439 bits:24 flags:0
Nov 24 10:44:37 crate kernel: [ 6.122198] block drbd0: uuid_compare()=-1 by rule 50
Nov 24 10:44:37 crate kernel: [ 6.122207] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Nov 24 10:44:37 crate kernel: [ 6.152120] block drbd0: conn( WFBitMapT -> WFSyncUUID )
Nov 24 10:44:37 crate kernel: [ 6.155392] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Nov 24 10:44:37 crate kernel: [ 6.162690] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Nov 24 10:44:37 crate kernel: [ 6.162701] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
Nov 24 10:44:37 crate kernel: [ 6.162709] block drbd0: Began resync as SyncTarget (will sync 24580 KB [6145 bits set]).
After this corosync starts:
Nov 24 10:44:37 crate corosync[1105]: [MAIN ] Corosync Cluster Engine ('1.2.1'): started and ready to provide service.
and soon established synchronisation:
Nov 24 10:44:57 crate corosync[1105]: [MAIN ] Completed service synchronization, ready to provide service.
Soon it starts probing the services, and then this:
Nov 24 10:44:57 crate mysql[1268]: ERROR: Config /etc/mysql/my.cnf doesn't exist
Nov 24 10:44:57 crate crmd: [1147]: info: process_lrm_event: LRM operation res-Mysql_monitor_0 (call=2, rc=7, cib-update=7, confirmed=true) not running
Nov 24 10:44:57 crate crmd: [1147]: info: process_lrm_event: LRM operation res-Wmysql3-IP_monitor_0 (call=4, rc=7, cib-update=8, confirmed=true) not running
Nov 24 10:44:57 crate attrd: [1145]: info: attrd_trigger_update: Sending flush op to all hosts for: master-res-Drbd-R0:0 (10000)
Nov 24 10:44:57 crate attrd: [1145]: info: attrd_perform_update: Sent update 15: master-res-Drbd-R0:0=10000
Nov 24 10:44:57 crate crmd: [1147]: info: process_lrm_event: LRM operation res-Drbd-R0:0_monitor_0 (call=5, rc=0, cib-update=9, confirmed=true) ok
Nov 24 10:44:57 crate crmd: [1147]: info: process_lrm_event: LRM operation res-Wmysql3-Fs_monitor_0 (call=3, rc=7, cib-update=10, confirmed=true) not running
Nov 24 10:44:58 crate lrmd: [1144]: info: rsc:res-Ping:0:6: probe
Nov 24 10:44:58 crate crmd: [1147]: info: process_lrm_event: LRM operation res-Ping:0_monitor_0 (call=6, rc=7, cib-update=11, confirmed=true) not running
Nov 24 10:44:58 crate attrd: [1145]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Nov 24 10:44:58 crate attrd: [1145]: info: attrd_perform_update: Sent update 18: probe_complete=true
Nov 24 10:44:58 crate crmd: [1147]: info: do_lrm_rsc_op: Performing key=61:6:0:d7f80eee-bab7-46fc-9cec-cd9f26ae852b op=res-Drbd-R0:0_notify_0 )
Nov 24 10:44:58 crate lrmd: [1144]: info: rsc:res-Drbd-R0:0:7: notify
Nov 24 10:44:58 crate crmd: [1147]: info: do_lrm_rsc_op: Performing key=45:6:0:d7f80eee-bab7-46fc-9cec-cd9f26ae852b op=res-Ping:0_start_0 )
Nov 24 10:44:58 crate lrmd: [1144]: info: rsc:res-Ping:0:8: start
Nov 24 10:44:58 crate crmd: [1147]: info: process_lrm_event: LRM operation res-Drbd-R0:0_notify_0 (call=7, rc=0, cib-update=12, confirmed=true) ok
Nov 24 10:44:58 crate crmd: [1147]: info: do_lrm_rsc_op: Performing key=16:6:0:d7f80eee-bab7-46fc-9cec-cd9f26ae852b op=res-Drbd-R0:0_stop_0 )
Nov 24 10:44:58 crate lrmd: [1144]: info: rsc:res-Drbd-R0:0:9: stop
Nov 24 10:44:58 crate lrmd: [1144]: info: RA output: (res-Drbd-R0:0:stop:stdout)
Nov 24 10:44:58 crate kernel: [ 27.313216] block drbd0: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
Nov 24 10:44:58 crate kernel: [ 27.313254] block drbd0: short read expecting header on sock: r=-512
Nov 24 10:44:58 crate kernel: [ 27.313286] block drbd0: asender terminated
Nov 24 10:44:58 crate kernel: [ 27.313292] block drbd0: Terminating asender thread
Nov 24 10:44:58 crate kernel: [ 27.313603] block drbd0: Connection closed
Nov 24 10:44:58 crate kernel: [ 27.313638] block drbd0: conn( Disconnecting -> StandAlone )
Nov 24 10:44:58 crate kernel: [ 27.313661] block drbd0: receiver terminated
Nov 24 10:44:58 crate kernel: [ 27.313666] block drbd0: Terminating receiver thread
Nov 24 10:44:58 crate kernel: [ 27.313753] block drbd0: disk( UpToDate -> Diskless )
Nov 24 10:44:58 crate kernel: [ 27.313820] block drbd0: drbd_bm_resize called with capacity == 0
Nov 24 10:44:58 crate kernel: [ 27.313892] block drbd0: worker terminated
Nov 24 10:44:58 crate kernel: [ 27.313897] block drbd0: Terminating worker thread
Nov 24 10:44:58 crate kernel: [ 27.322840] block drbd0: State change failed: Disk state is lower than outdated
Nov 24 10:44:58 crate kernel: [ 27.322850] block drbd0: state = { cs:StandAlone ro:Secondary/Unknown ds:Diskless/DUnknown r--- }
Nov 24 10:44:58 crate kernel: [ 27.322857] block drbd0: wanted = { cs:StandAlone ro:Secondary/Unknown ds:Outdated/DUnknown r--- }
Nov 24 10:44:58 crate lrmd: [1144]: info: RA output: (res-Drbd-R0:0:stop:stdout)
Nov 24 10:44:58 crate attrd: [1145]: info: attrd_trigger_update: Sending flush op to all hosts for: master-res-Drbd-R0:0 (-INFINITY)
Nov 24 10:44:58 crate attrd: [1145]: info: attrd_perform_update: Sent update 20: master-res-Drbd-R0:0=-INFINITY
Nov 24 10:44:58 crate lrmd: [1144]: info: RA output: (res-Drbd-R0:0:stop:stdout)
Now, I'd like to know why it is shutting drbd down. If I cat /proc/drbd
before pacemaker gets going, I see this:
# cat /proc/drbd
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by root at byrd, 2010-11-24 07:59:52
0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
Then, two seconds later:
# cat /proc/drbd
version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by root at byrd, 2010-11-24 07:59:52
0: cs:Unconfigured
Since it is entirely possible that this is caused by misconfiguring pacemaker,
I include the configuration below.
-- Owen Le Blanc
node crate
node creel
primitive res-Drbd-R0 ocf:linbit:drbd \
operations $id="res-Drbd-R0-operations" \
op monitor interval="20" role="Slave" timeout="20" start-delay="1m" \
op monitor interval="10" role="Master" timeout="20" start-delay="1m" \
params drbd_resource="r0"
primitive res-Mysql ocf:heartbeat:mysql \
meta target-role="Started" \
operations $id="res-Mysql-operations" \
op monitor interval="10" timeout="30" \
params binary="/usr/bin/mysqld_safe" config="/etc/mysql/my.cnf" pid="/var/run/mysqld/mysqld.pid" socket="/var/run/mysqld/mysql.sock"
primitive res-Ping ocf:pacemaker:ping \
operations $id="res-Ping-operations" \
op monitor interval="10" timeout="60" \
params dampen="10s" host_list="130.88.212.250 130.88.203.250 130.88.200.250" multiplier="100"
primitive res-Wmysql3-Fs ocf:heartbeat:Filesystem \
meta target-role="started" \
operations $id="res-Wmysql3-Fs-operations" \
op monitor interval="20" timeout="40" \
params device="/dev/drbd0" directory="/var/data" fstype="ext3"
primitive res-Wmysql3-IP ocf:heartbeat:IPaddr2 \
meta target-role="started" \
operations $id="res-Wmysql3-IP-operations" \
op monitor interval="10s" timeout="20s" \
params ip ip="10.99.212.247"
ms ms-Drbd-R0 res-Drbd-R0 \
meta target-role="Master" clone-max="2" globally-unique="false" notify="true" is-managed="true"
clone clo-Ping res-Ping \
meta target-role="started"
colocation col-Fs-Drbd +inf: res-Wmysql3-Fs ms-Drbd-R0:Master
colocation col-Fs-Mysql +inf: res-Mysql res-Wmysql3-Fs
colocation col-Ip-Drbd +inf: ms-Drbd-R0 res-Wmysql3-IP
order ord-Drbd-Fs +inf: ms-Drbd-R0:promote res-Wmysql3-Fs:start
order ord-Fs-Mysql +inf: res-Wmysql3-Fs res-Mysql
order ord-Ip-Drbd +inf: res-Wmysql3-IP ms-Drbd-R0
property $id="cib-bootstrap-options" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
default-resource-stickiness="20" \
no-quorum-policy="ignore" \
stonith-enabled="false" \
default-action-timeout="60s" \
dc-version="1.0.9-unknown"