[DRBD-user] pacemaker fails to start drbd using ocf:linbit:drbd

Thu Jul 1 10:23:58 CEST 2010

On Wed, Jun 30, 2010 at 03:49:22PM -0500, Bart Willems wrote:
> I am setting SLES11 SP1 HA on 2 nodes and have configures a master/slave
> drbd resource. I can start drbd, promote/demote hosts. mount/use the file
> system from the command line, but pacemaker fails to properly start up the
> drdb service. The 2 nodes are named storm (master) and storm-b (slave). 
> 
> Details of my setup are:
> 
> **********
> * storm: *
> **********
> 
> eth0: 172.16.0.1/16 (static)
> eth1: 172.20.168.239 (dhcp)
> ipmi: 172.16.1.1/16 (static)
> 
> ************
> * storm-b: *
> ************
> 
> eth0: 172.16.0.2/16 (static)
> eth1: 172.20.168.114 (dhcp)
> ipmi: 172.16.1.2/16 (static)
> 
> ***********************
> * drbd configuration: *
> ***********************
> 
> storm:~ # cat /etc/drbd.conf 
> #
> # please have a a look at the example configuration file in
> # /usr/share/doc/packages/drbd-utils/drbd.conf
> #
> # Note that you can use the YaST2 drbd module to configure this
> # service!
> #
> include "drbd.d/global_common.conf";
> include "drbd.d/*.res";
> 
> storm:~ # cat /etc/drbd.d/r0.res 
> resource r0 {
>         device /dev/drbd_r0 minor 0;
>         meta-disk internal;
>         on storm {
>                 disk /dev/sdc1;
>                 address 172.16.0.1:7811;
>         }
>         on storm-b {
>                 disk /dev/sde1;
>                 address 172.16.0.2:7811;
>         }
>         syncer  {
>                 rate    120M;
>         }
> }
> 
> ***********************************
> * Output of "crm configure show": *
> ***********************************
> 
> storm:~ # crm configure show
> node storm
> node storm-b
> primitive backupExec-ip ocf:heartbeat:IPaddr \
>         params ip="172.16.0.10" cidr_netmask="16" nic="eth0" \
>         op monitor interval="30s"
> primitive drbd-storage ocf:linbit:drbd \
>         params drbd_resource="r0" \
>         op monitor interval="60" role="Master" timeout="60" \
>         op start interval="0" timeout="240" \
>         op stop interval="0" timeout="100" \
>         op monitor interval="61" role="Slave" timeout="60"
> primitive drbd-storage-fs ocf:heartbeat:Filesystem \
>         params device="/dev/drbd0" directory="/disk1" fstype="ext3"
> primitive public-ip ocf:heartbeat:IPaddr \
>         meta target-role="started" \
>         operations $id="public-ip-operations" \
>         op monitor interval="30s" \
>         params ip="143.219.41.20" cidr_netmask="24" nic="eth1"
> primitive storm-fencing stonith:external/ipmi \
>         meta target-role="started" \
>         operations $id="storm-fencing-operations" \
>         op monitor interval="60" timeout="20" \
>         op start interval="0" timeout="20" \
>         params hostname="storm" ipaddr="172.16.1.1" userid="****"
> passwd="****" interface="lan"
> ms drbd-storage-masterslave drbd-storage \
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true" globally-unique="false"
> target-role="started"
> location drbd-storage-master-location drbd-storage-masterslave +inf: storm
> location storm-fencing-location storm-fencing +inf: storm-b
> colocation drbd-storage-fs-together inf: drbd-storage-fs
> drbd-storage-masterslave:Master
> order drbd-storage-fs-startup-order inf: drbd-storage-masterslave:promote
> drbd-storage-fs:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         no-quorum-policy="ignore" \
>         last-lrm-refresh="1277922623" \
>         node-health-strategy="only-green" \
>         stonith-enabled="true" \
>         stonith-action="poweroff"
> op_defaults $id="op_defaults-options" \
>         record-pending="false"
> 		
> ************************************
> * Output of "crm_mon -o" on storm: *
> ************************************
> 
> storm:~ # crm_mon -o 
> Attempting connection to the cluster...
> ============
> Last updated: Wed Jun 30 15:25:15 2010
> Stack: openais
> Current DC: storm - partition with quorum
> Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
> 2 Nodes configured, 2 expected votes
> 5 Resources configured.
> ============
> 
> Online: [ storm storm-b ]
> 
> storm-fencing   (stonith:external/ipmi):        Started storm-b
> backupExec-ip   (ocf::heartbeat:IPaddr):        Started storm
> public-ip       (ocf::heartbeat:IPaddr):        Started storm
> 
> Operations:
> * Node storm: 
>    public-ip: migration-threshold=1000000
>     + (8) start: rc=0 (ok)
>     + (11) monitor: interval=30000ms rc=0 (ok)
>    backupExec-ip: migration-threshold=1000000
>     + (7) start: rc=0 (ok)
>     + (10) monitor: interval=30000ms rc=0 (ok)
>    drbd-storage:0: migration-threshold=1000000 fail-count=1000000
>     + (9) start: rc=-2 (unknown exec error)
>     + (14) stop: rc=0 (ok)
> * Node storm-b: 
>    storm-fencing: migration-threshold=1000000    + (7) start: rc=0 (ok)    +
> (9) monitor: interval=6)
>   
> **************************************  
> * Output of "crm_mon -o" on storm-b: *
> **************************************
> 
> storm-b:~ # crm_mon -o
> Attempting connection to the cluster...
> ============
> Last updated: Wed Jun 30 15:25:25 2010
> Stack: openais
> Current DC: storm - partition with quorum
> Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
> 2 Nodes configured, 2 expected votes
> 5 Resources configured.
> ============
> 
> Online: [ storm storm-b ]
> 
> storm-fencing   (stonith:external/ipmi):        Started storm-b
> backupExec-ip   (ocf::heartbeat:IPaddr):        Started storm
> public-ip       (ocf::heartbeat:IPaddr):        Started storm
> 
> Operations:
> * Node storm: 
>    public-ip: migration-threshold=1000000
>     + (8) start: rc=0 (ok)
>     + (11) monitor: interval=30000ms rc=0 (ok)
>    backupExec-ip: migration-threshold=1000000
>     + (7) start: rc=0 (ok)
>     + (10) monitor: interval=30000ms rc=0 (ok)
>    drbd-storage:0: migration-threshold=1000000 fail-count=1000000
>     + (9) start: rc=-2 (unknown exec error)
>     + (14) stop: rc=0 (ok)
> * Node storm-b: 
>    storm-fencing: migration-threshold=1000000
>     + (7) start: rc=0 (ok)
>     + (9) monitor: interval=60000ms rc=0 (ok)
>    drbd-storage:1: migration-threshold=1000000 fail-count=1000000
>     + (8) start: rc=-2 (unknown exec error)
>     + (12) stop: rc=0 (ok)
> 
> Failed actions:
>     drbd-storage:0_start_0 (node=storm, call=9, rc=-2, status=Timed Out):
> unknown exec error

status Timed Out ??

>     drbd-storage:1_start_0 (node=storm-b, call=8, rc=-2, status=Timed Out):
> unknown exec error
> 	
> 
> ********************************************************
> * Output of "rcdrbd status" on both storm and storm-b: *
> ********************************************************
> 
> # rcdrbd status
> drbd driver loaded OK; device status:
> version: 8.3.7 (api:88/proto:86-91)
> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by phil at fat-tyre,
> 2010-01-13 17:17:27
> m:res  cs          ro                 ds                 p      mounted
> fstype
> 0:r0   StandAlone  Secondary/Unknown  UpToDate/DUnknown  r----
> 
> *********************************
> * Part of the drbd log entries: *
> *********************************
> 

PLEASE avoid line wraps in pasted log files.
If you cannot, attach unwrapped as text/plain (or gzip)

> Jun 30 15:38:10 storm kernel: [ 3730.185457] drbd: initialized. Version: 8.3.7 (api:88/proto:86-91)
> Jun 30 15:38:10 storm kernel: [ 3730.185459] drbd: GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by phil at fat-tyre, 2010-01-13 17:17:27
> Jun 30 15:38:10 storm kernel: [ 3730.185460] drbd: registered as block device major 147
> Jun 30 15:38:10 storm kernel: [ 3730.185462] drbd: minor_table @ 0xffff88035fc0ca80
> Jun 30 15:38:10 storm kernel: [ 3730.188253] block drbd0: Starting worker thread (from cqueue [9510])
> Jun 30 15:38:10 storm kernel: [ 3730.188312] block drbd0: disk( Diskless -> Attaching ) 
> Jun 30 15:38:10 storm kernel: [ 3730.188866] block drbd0: Found 4 transactions (4 active extents) in activity log.
> Jun 30 15:38:10 storm kernel: [ 3730.188868] block drbd0: Method to ensure write ordering: barrier
> Jun 30 15:38:10 storm kernel: [ 3730.188870] block drbd0: max_segment_size ( = BIO size ) = 32768
> Jun 30 15:38:10 storm kernel: [ 3730.188872] block drbd0: drbd_bm_resize called with capacity == 9765216
> Jun 30 15:38:10 storm kernel: [ 3730.188907] block drbd0: resync bitmap: bits=1220652 words=19073
> Jun 30 15:38:10 storm kernel: [ 3730.188910] block drbd0: size = 4768 MB (4882608 KB)
> Jun 30 15:38:10 storm lrmd: [15233]: info: RA output: (drbd-storage:0:start:stdout) 
> Jun 30 15:38:10 storm kernel: [ 3730.189263] block drbd0: recounting of set bits took additional 0 jiffies
> Jun 30 15:38:10 storm kernel: [ 3730.189265] block drbd0: 4 KB (1 bits) marked out-of-sync by on disk bit-map.
> Jun 30 15:38:10 storm kernel: [ 3730.189269] block drbd0: disk( Attaching -> UpToDate ) 
> Jun 30 15:38:10 storm kernel: [ 3730.191735] block drbd0: conn( StandAlone -> Unconnected ) 
> Jun 30 15:38:10 storm kernel: [ 3730.191748] block drbd0: Starting receiver thread (from drbd0_worker [15487])
> Jun 30 15:38:10 storm kernel: [ 3730.191780] block drbd0: receiver (re)started
> Jun 30 15:38:10 storm kernel: [ 3730.191785] block drbd0: conn( Unconnected -> WFConnection ) 
> Jun 30 15:38:10 storm lrmd: [15233]: info: RA output: (drbd-storage:0:start:stderr) 0: Failure: (124) Device is attached to a disk (use detach first)
> Jun 30 15:38:10 storm lrmd: [15233]: info: RA output: (drbd-storage:0:start:stderr) Command 'drbdsetup 0 disk /dev/sdc1 /dev/sdc1 internal 
> Jun 30 15:38:10 storm lrmd: [15233]: info: RA output: (drbd-storage:0:start:stderr) --set-defaults --create-device' terminated with exit code 10
> Jun 30 15:38:10 storm drbd[15341]: ERROR: r0: Called drbdadm -c /etc/drbd.conf --peer storm-b up r0
> I made sure rcdrbd was stopped before starting rcopenais, so the failure
> related to the device being attached arrises during openais startup.

well. Either that log snipped is not from one of those tries,
or you apparently somehow still have concurrent scripts
trying to up this simultaneously. Don't do that.

> *************************
> * Result of ocf-tester: *
> *************************
> 
> storm:~ # ocf-tester -n drbd-storage -o drbd_resource="r0"
> /usr/lib/ocf/resource.d/linbit/drbd
> Beginning tests for /usr/lib/ocf/resource.d/linbit/drbd...
> * rc=6: Validation failed.  Did you supply enough options with -o ?
> Aborting tests
> 
> The only required parameter according to "crm ra info ocf:linbit:drbd" is
> drbd_resource, so there shouldn't be any additional options required to make
> ocf-tester work.
> 
> I posted this to the pacemaker mailing list, but thought I 'ld cross-post
> because of the ocf-tester failure. Any suggestions for debugging and
> solutions would be most appreciated.

Rest assured that the drbd.ocf agent is ocf clean.
It is the reference implementation of "master-slave" resources ;-)

It just so happens that it validates more than just its parameters,
it also validates various meta parameters and other configuration
details.

Something like this:
OCF_RESKEY_CRM_meta_notify_start_uname="" \
OCF_RESKEY_CRM_meta_clone_max=2 \
ocf-tester -n drbd-storage -o drbd_resource="r0" /usr/lib/ocf/resource.d/linbit/drbd

should get you going there.

"Works for me".

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed