[DRBD-user] Promote fails in state = { cs:WFConnection ro:Secondary/Unknown ds:Consistent/DUnknown r--- }

Thu Jan 19 12:52:03 CET 2012

Hi everyone,
First, I would like to express my pleasure using DRBD!
Here is my situation:

Two-node setup, using cman and pacemaker, don't care about quorum, no stonithMaster-Slave DRBD resource
Fence resource only
I noticed that under certain settings (powering on/off nodes enough times) the secondary node may never becomes promoted when primary is shutdown. 
Here is a sample log (attached)

Jan 18 08:34:52 NODE-1 crmd: [2054]: info: do_lrm_rsc_op: Performing key=7:89911:0:aac20e27-939f-439c-b461-e668262718b3 op=drbd_fsroot:0_promote_0 )
Jan 18 08:34:52 NODE-1 lrmd: [2051]: info: rsc:drbd_fsroot:0:299768: promote
Jan 18 08:34:52 NODE-1 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
Jan 18 08:34:52 NODE-1 corosync[1759]:   [TOTEM ] Automatically recovered ring 1
Jan 18 08:34:53 NODE-1 crm-fence-peer.sh[24325]: invoked for fsroot
Jan 18 08:34:53 NODE-1 corosync[1759]:   [TOTEM ] Automatically recovered ring 1
Jan 18 08:34:53 NODE-1 crm-fence-peer.sh[24325]: WARNING peer is unreachable, my disk is Consistent: did not place the constraint!
Jan 18 08:34:53 NODE-1 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 5 (0x500)
Jan 18 08:34:53 NODE-1 kernel: block drbd0: fence-peer helper returned 5 (peer unreachable, doing nothing since disk != UpToDate)
Jan 18 08:34:53 NODE-1 kernel: block drbd0: State change failed: Need access to UpToDate data
Jan 18 08:34:53 NODE-1 kernel: block drbd0:   state = { cs:WFConnection ro:Secondary/Unknown ds:Consistent/DUnknown r--- }
Jan 18 08:34:53 NODE-1 kernel: block drbd0:  wanted = { cs:WFConnection ro:Primary/Unknown ds:Consistent/DUnknown r--- }
Jan 18 08:34:53 NODE-1 lrmd: [2051]: info: RA output: (drbd_fsroot:0:promote:stderr) 0: State change failed: (-2) Need access to UpToDate data
Jan 18 08:34:53 NODE-1 lrmd: [2051]: info: RA output: (drbd_fsroot:0:promote:stderr) Command 'drbdsetup 0 primary' terminated with exit code 17
Jan 18 08:34:53 NODE-1 drbd[24286]: ERROR: fsroot: Called drbdadm -c /etc/drbd.conf primary fsroot
Jan 18 08:34:53 NODE-1 drbd[24286]: ERROR: fsroot: Exit code 17
Jan 18 08:34:53 NODE-1 drbd[24286]: ERROR: fsroot: Command output:
Jan 18 08:34:53 NODE-1 lrmd: [2051]: info: RA output: (drbd_fsroot:0:promote:stdout)
Jan 18 08:34:53 NODE-1 drbd[24286]: CRIT: Refusing to be promoted to Primary without UpToDate data
Jan 18 08:34:53 NODE-1 lrmd: [2051]: WARN: Managed drbd_fsroot:0:promote process 24286 exited with return code 1.
Jan 18 08:34:53 NODE-1 crmd: [2054]: info: process_lrm_event: LRM operation drbd_fsroot:0_promote_0 (call=299768, rc=1, cib-update=209843, confirmed=true) unknown error
Jan 18 08:34:53 NODE-1 crmd: [2054]: WARN: status_from_rc: Action 7 (drbd_fsroot:0_promote_0) on NODE-1 failed (target: 0 vs. rc: 1): Error
Jan 18 08:34:53 NODE-1 crmd: [2054]: WARN: update_failcount: Updating failcount for drbd_fsroot:0 on NODE-1 after failed promote: rc=1 (update=value++, time=1326893693)
Jan 18 08:34:53 NODE-1 attrd: [2052]: info: attrd_local_callback: Expanded fail-count-drbd_fsroot:0=value++ to 29977
Jan 18 08:34:53 NODE-1 attrd: [2052]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-drbd_fsroot:0 (29977)
Jan 18 08:34:53 NODE-1 crmd: [2054]: info: abort_transition_graph: match_graph_event:277 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=drbd_fsroot:0_last_failure_0, magic=0:1;7:89911:0:aac20e27-939f-439c-b461-e668262718b3, cib=0.6.263577) : Event failed It seems to me that not promoting/fencing is worse alternative in case the other node is really shutdown and no stonith is configured to be used.
As a workaround changeing the next line in /usr/lib/drbd/crm-fence-peer.sh solves this
... 
try_place_constraint()...
- unreachable/Consistent/outdated)
+ unreachable/Consistent/outdated|\
+ unreachable/Consistent/unknown)
What say you? I use 
Linux 2.6.32-220.2.1.el6.i686 #1 SMP Thu Dec 22 18:50:52 GMT 2011 i686 i686 i386 GNU/Linux kmod-drbd83-8.3.8-1.el6.i686
drbd83-8.3.8-1.el6.i686 corosync-1.4.1-4.el6.i686
corosynclib-1.4.1-4.el6.i686
pacemaker-1.1.6-3.el6.i686
pacemaker-libs-1.1.6-3.el6.i686
pacemaker-cluster-libs-1.1.6-3.el6.i686
pacemaker-cli-1.1.6-3.el6.i686
cman-3.0.12.1-23.el6.i686 Best,Oren  		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120119/7371d926/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.1.gz
Type: application/x-gzip-compressed
Size: 77464 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120119/7371d926/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.2.gz
Type: application/x-gzip-compressed
Size: 47690 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120119/7371d926/attachment-0001.bin>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fsroot.res
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120119/7371d926/attachment.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fsglobal_common.conf
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120119/7371d926/attachment.asc>