Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi everyone, First, I would like to express my pleasure using DRBD! Here is my situation: Two-node setup, using cman and pacemaker, don't care about quorum, no stonithMaster-Slave DRBD resource Fence resource only I noticed that under certain settings (powering on/off nodes enough times) the secondary node may never becomes promoted when primary is shutdown. Here is a sample log (attached) Jan 18 08:34:52 NODE-1 crmd: [2054]: info: do_lrm_rsc_op: Performing key=7:89911:0:aac20e27-939f-439c-b461-e668262718b3 op=drbd_fsroot:0_promote_0 ) Jan 18 08:34:52 NODE-1 lrmd: [2051]: info: rsc:drbd_fsroot:0:299768: promote Jan 18 08:34:52 NODE-1 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 Jan 18 08:34:52 NODE-1 corosync[1759]: [TOTEM ] Automatically recovered ring 1 Jan 18 08:34:53 NODE-1 crm-fence-peer.sh[24325]: invoked for fsroot Jan 18 08:34:53 NODE-1 corosync[1759]: [TOTEM ] Automatically recovered ring 1 Jan 18 08:34:53 NODE-1 crm-fence-peer.sh[24325]: WARNING peer is unreachable, my disk is Consistent: did not place the constraint! Jan 18 08:34:53 NODE-1 kernel: block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 5 (0x500) Jan 18 08:34:53 NODE-1 kernel: block drbd0: fence-peer helper returned 5 (peer unreachable, doing nothing since disk != UpToDate) Jan 18 08:34:53 NODE-1 kernel: block drbd0: State change failed: Need access to UpToDate data Jan 18 08:34:53 NODE-1 kernel: block drbd0: state = { cs:WFConnection ro:Secondary/Unknown ds:Consistent/DUnknown r--- } Jan 18 08:34:53 NODE-1 kernel: block drbd0: wanted = { cs:WFConnection ro:Primary/Unknown ds:Consistent/DUnknown r--- } Jan 18 08:34:53 NODE-1 lrmd: [2051]: info: RA output: (drbd_fsroot:0:promote:stderr) 0: State change failed: (-2) Need access to UpToDate data Jan 18 08:34:53 NODE-1 lrmd: [2051]: info: RA output: (drbd_fsroot:0:promote:stderr) Command 'drbdsetup 0 primary' terminated with exit code 17 Jan 18 08:34:53 NODE-1 drbd[24286]: ERROR: fsroot: Called drbdadm -c /etc/drbd.conf primary fsroot Jan 18 08:34:53 NODE-1 drbd[24286]: ERROR: fsroot: Exit code 17 Jan 18 08:34:53 NODE-1 drbd[24286]: ERROR: fsroot: Command output: Jan 18 08:34:53 NODE-1 lrmd: [2051]: info: RA output: (drbd_fsroot:0:promote:stdout) Jan 18 08:34:53 NODE-1 drbd[24286]: CRIT: Refusing to be promoted to Primary without UpToDate data Jan 18 08:34:53 NODE-1 lrmd: [2051]: WARN: Managed drbd_fsroot:0:promote process 24286 exited with return code 1. Jan 18 08:34:53 NODE-1 crmd: [2054]: info: process_lrm_event: LRM operation drbd_fsroot:0_promote_0 (call=299768, rc=1, cib-update=209843, confirmed=true) unknown error Jan 18 08:34:53 NODE-1 crmd: [2054]: WARN: status_from_rc: Action 7 (drbd_fsroot:0_promote_0) on NODE-1 failed (target: 0 vs. rc: 1): Error Jan 18 08:34:53 NODE-1 crmd: [2054]: WARN: update_failcount: Updating failcount for drbd_fsroot:0 on NODE-1 after failed promote: rc=1 (update=value++, time=1326893693) Jan 18 08:34:53 NODE-1 attrd: [2052]: info: attrd_local_callback: Expanded fail-count-drbd_fsroot:0=value++ to 29977 Jan 18 08:34:53 NODE-1 attrd: [2052]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-drbd_fsroot:0 (29977) Jan 18 08:34:53 NODE-1 crmd: [2054]: info: abort_transition_graph: match_graph_event:277 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=drbd_fsroot:0_last_failure_0, magic=0:1;7:89911:0:aac20e27-939f-439c-b461-e668262718b3, cib=0.6.263577) : Event failed It seems to me that not promoting/fencing is worse alternative in case the other node is really shutdown and no stonith is configured to be used. As a workaround changeing the next line in /usr/lib/drbd/crm-fence-peer.sh solves this ... try_place_constraint()... - unreachable/Consistent/outdated) + unreachable/Consistent/outdated|\ + unreachable/Consistent/unknown) What say you? I use Linux 2.6.32-220.2.1.el6.i686 #1 SMP Thu Dec 22 18:50:52 GMT 2011 i686 i686 i386 GNU/Linux kmod-drbd83-8.3.8-1.el6.i686 drbd83-8.3.8-1.el6.i686 corosync-1.4.1-4.el6.i686 corosynclib-1.4.1-4.el6.i686 pacemaker-1.1.6-3.el6.i686 pacemaker-libs-1.1.6-3.el6.i686 pacemaker-cluster-libs-1.1.6-3.el6.i686 pacemaker-cli-1.1.6-3.el6.i686 cman-3.0.12.1-23.el6.i686 Best,Oren -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120119/7371d926/attachment.htm> -------------- next part -------------- A non-text attachment was scrubbed... Name: messages.1.gz Type: application/x-gzip-compressed Size: 77464 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120119/7371d926/attachment.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: messages.2.gz Type: application/x-gzip-compressed Size: 47690 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120119/7371d926/attachment-0001.bin> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: fsroot.res URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120119/7371d926/attachment.txt> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: fsglobal_common.conf URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120119/7371d926/attachment.asc>