[DRBD-user] pacemaker fails to start drbd using ocf:linbit:drbd

Wed Jun 30 22:49:22 CEST 2010

I am setting SLES11 SP1 HA on 2 nodes and have configures a master/slave
drbd resource. I can start drbd, promote/demote hosts. mount/use the file
system from the command line, but pacemaker fails to properly start up the
drdb service. The 2 nodes are named storm (master) and storm-b (slave). 

Details of my setup are:

**********
* storm: *
**********

eth0: 172.16.0.1/16 (static)
eth1: 172.20.168.239 (dhcp)
ipmi: 172.16.1.1/16 (static)

************
* storm-b: *
************

eth0: 172.16.0.2/16 (static)
eth1: 172.20.168.114 (dhcp)
ipmi: 172.16.1.2/16 (static)

***********************
* drbd configuration: *
***********************

storm:~ # cat /etc/drbd.conf 
#
# please have a a look at the example configuration file in
# /usr/share/doc/packages/drbd-utils/drbd.conf
#
# Note that you can use the YaST2 drbd module to configure this
# service!
#
include "drbd.d/global_common.conf";
include "drbd.d/*.res";

storm:~ # cat /etc/drbd.d/r0.res 
resource r0 {
        device /dev/drbd_r0 minor 0;
        meta-disk internal;
        on storm {
                disk /dev/sdc1;
                address 172.16.0.1:7811;
        }
        on storm-b {
                disk /dev/sde1;
                address 172.16.0.2:7811;
        }
        syncer  {
                rate    120M;
        }
}

***********************************
* Output of "crm configure show": *
***********************************

storm:~ # crm configure show
node storm
node storm-b
primitive backupExec-ip ocf:heartbeat:IPaddr \
        params ip="172.16.0.10" cidr_netmask="16" nic="eth0" \
        op monitor interval="30s"
primitive drbd-storage ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="60" role="Master" timeout="60" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100" \
        op monitor interval="61" role="Slave" timeout="60"
primitive drbd-storage-fs ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" directory="/disk1" fstype="ext3"
primitive public-ip ocf:heartbeat:IPaddr \
        meta target-role="started" \
        operations $id="public-ip-operations" \
        op monitor interval="30s" \
        params ip="143.219.41.20" cidr_netmask="24" nic="eth1"
primitive storm-fencing stonith:external/ipmi \
        meta target-role="started" \
        operations $id="storm-fencing-operations" \
        op monitor interval="60" timeout="20" \
        op start interval="0" timeout="20" \
        params hostname="storm" ipaddr="172.16.1.1" userid="****"
passwd="****" interface="lan"
ms drbd-storage-masterslave drbd-storage \
        meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" globally-unique="false"
target-role="started"
location drbd-storage-master-location drbd-storage-masterslave +inf: storm
location storm-fencing-location storm-fencing +inf: storm-b
colocation drbd-storage-fs-together inf: drbd-storage-fs
drbd-storage-masterslave:Master
order drbd-storage-fs-startup-order inf: drbd-storage-masterslave:promote
drbd-storage-fs:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1277922623" \
        node-health-strategy="only-green" \
        stonith-enabled="true" \
        stonith-action="poweroff"
op_defaults $id="op_defaults-options" \
        record-pending="false"

************************************
* Output of "crm_mon -o" on storm: *
************************************

storm:~ # crm_mon -o 
Attempting connection to the cluster...
============
Last updated: Wed Jun 30 15:25:15 2010
Stack: openais
Current DC: storm - partition with quorum
Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
2 Nodes configured, 2 expected votes
5 Resources configured.
============

Online: [ storm storm-b ]

storm-fencing   (stonith:external/ipmi):        Started storm-b
backupExec-ip   (ocf::heartbeat:IPaddr):        Started storm
public-ip       (ocf::heartbeat:IPaddr):        Started storm

Operations:
* Node storm: 
   public-ip: migration-threshold=1000000
    + (8) start: rc=0 (ok)
    + (11) monitor: interval=30000ms rc=0 (ok)
   backupExec-ip: migration-threshold=1000000
    + (7) start: rc=0 (ok)
    + (10) monitor: interval=30000ms rc=0 (ok)
   drbd-storage:0: migration-threshold=1000000 fail-count=1000000
    + (9) start: rc=-2 (unknown exec error)
    + (14) stop: rc=0 (ok)
* Node storm-b: 
   storm-fencing: migration-threshold=1000000    + (7) start: rc=0 (ok)    +
(9) monitor: interval=6)

**************************************  
* Output of "crm_mon -o" on storm-b: *
**************************************

storm-b:~ # crm_mon -o
Attempting connection to the cluster...
============
Last updated: Wed Jun 30 15:25:25 2010
Stack: openais
Current DC: storm - partition with quorum
Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
2 Nodes configured, 2 expected votes
5 Resources configured.
============

Online: [ storm storm-b ]

storm-fencing   (stonith:external/ipmi):        Started storm-b
backupExec-ip   (ocf::heartbeat:IPaddr):        Started storm
public-ip       (ocf::heartbeat:IPaddr):        Started storm

Operations:
* Node storm: 
   public-ip: migration-threshold=1000000
    + (8) start: rc=0 (ok)
    + (11) monitor: interval=30000ms rc=0 (ok)
   backupExec-ip: migration-threshold=1000000
    + (7) start: rc=0 (ok)
    + (10) monitor: interval=30000ms rc=0 (ok)
   drbd-storage:0: migration-threshold=1000000 fail-count=1000000
    + (9) start: rc=-2 (unknown exec error)
    + (14) stop: rc=0 (ok)
* Node storm-b: 
   storm-fencing: migration-threshold=1000000
    + (7) start: rc=0 (ok)
    + (9) monitor: interval=60000ms rc=0 (ok)
   drbd-storage:1: migration-threshold=1000000 fail-count=1000000
    + (8) start: rc=-2 (unknown exec error)
    + (12) stop: rc=0 (ok)

Failed actions:
    drbd-storage:0_start_0 (node=storm, call=9, rc=-2, status=Timed Out):
unknown exec error
    drbd-storage:1_start_0 (node=storm-b, call=8, rc=-2, status=Timed Out):
unknown exec error

********************************************************
* Output of "rcdrbd status" on both storm and storm-b: *
********************************************************

# rcdrbd status
drbd driver loaded OK; device status:
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by phil at fat-tyre,
2010-01-13 17:17:27
m:res  cs          ro                 ds                 p      mounted
fstype
0:r0   StandAlone  Secondary/Unknown  UpToDate/DUnknown  r----

*********************************
* Part of the drbd log entries: *
*********************************

Jun 30 15:38:10 storm kernel: [ 3730.185457] drbd: initialized. Version:
8.3.7 (api:88/proto:86-91)
Jun 30 15:38:10 storm kernel: [ 3730.185459] drbd: GIT-hash:
ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by phil at fat-tyre, 2010-01-13
17:17:27
Jun 30 15:38:10 storm kernel: [ 3730.185460] drbd: registered as block
device major 147
Jun 30 15:38:10 storm kernel: [ 3730.185462] drbd: minor_table @
0xffff88035fc0ca80
Jun 30 15:38:10 storm kernel: [ 3730.188253] block drbd0: Starting worker
thread (from cqueue [9510])
Jun 30 15:38:10 storm kernel: [ 3730.188312] block drbd0: disk( Diskless ->
Attaching ) 
Jun 30 15:38:10 storm kernel: [ 3730.188866] block drbd0: Found 4
transactions (4 active extents) in activity log.
Jun 30 15:38:10 storm kernel: [ 3730.188868] block drbd0: Method to ensure
write ordering: barrier
Jun 30 15:38:10 storm kernel: [ 3730.188870] block drbd0: max_segment_size (
= BIO size ) = 32768
Jun 30 15:38:10 storm kernel: [ 3730.188872] block drbd0: drbd_bm_resize
called with capacity == 9765216
Jun 30 15:38:10 storm kernel: [ 3730.188907] block drbd0: resync bitmap:
bits=1220652 words=19073
Jun 30 15:38:10 storm kernel: [ 3730.188910] block drbd0: size = 4768 MB
(4882608 KB)
Jun 30 15:38:10 storm lrmd: [15233]: info: RA output:
(drbd-storage:0:start:stdout) 
Jun 30 15:38:10 storm kernel: [ 3730.189263] block drbd0: recounting of set
bits took additional 0 jiffies
Jun 30 15:38:10 storm kernel: [ 3730.189265] block drbd0: 4 KB (1 bits)
marked out-of-sync by on disk bit-map.
Jun 30 15:38:10 storm kernel: [ 3730.189269] block drbd0: disk( Attaching ->
UpToDate ) 
Jun 30 15:38:10 storm kernel: [ 3730.191735] block drbd0: conn( StandAlone
-> Unconnected ) 
Jun 30 15:38:10 storm kernel: [ 3730.191748] block drbd0: Starting receiver
thread (from drbd0_worker [15487])
Jun 30 15:38:10 storm kernel: [ 3730.191780] block drbd0: receiver
(re)started
Jun 30 15:38:10 storm kernel: [ 3730.191785] block drbd0: conn( Unconnected
-> WFConnection ) 
Jun 30 15:38:10 storm lrmd: [15233]: info: RA output:
(drbd-storage:0:start:stderr) 0: Failure: (124) Device is attached to a disk
(use detach first)
Jun 30 15:38:10 storm lrmd: [15233]: info: RA output:
(drbd-storage:0:start:stderr) Command 'drbdsetup 0 disk /dev/sdc1 /dev/sdc1
internal 
Jun 30 15:38:10 storm lrmd: [15233]: info: RA output:
(drbd-storage:0:start:stderr) --set-defaults --create-device' terminated
with exit code 10
Jun 30 15:38:10 storm drbd[15341]: ERROR: r0: Called drbdadm -c
/etc/drbd.conf --peer storm-b up r0
Jun 30 15:38:10 storm drbd[15341]: ERROR: r0: Exit code 1
Jun 30 15:38:10 storm drbd[15341]: ERROR: r0: Command output: 

I made sure rcdrbd was stopped before starting rcopenais, so the failure
related to the device being attached arrises during openais startup.

*************************
* Result of ocf-tester: *
*************************

storm:~ # ocf-tester -n drbd-storage -o drbd_resource="r0"
/usr/lib/ocf/resource.d/linbit/drbd
Beginning tests for /usr/lib/ocf/resource.d/linbit/drbd...
* rc=6: Validation failed.  Did you supply enough options with -o ?
Aborting tests

The only required parameter according to "crm ra info ocf:linbit:drbd" is
drbd_resource, so there shouldn't be any additional options required to make
ocf-tester work.

I posted this to the pacemaker mailing list, but thought I 'ld cross-post
because of the ocf-tester failure. Any suggestions for debugging and
solutions would be most appreciated.

Thanks,
Bart