Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Oct 25, 2012 at 09:30:30PM +0100, Matt Willsher wrote: > Hi there, > > I've been following Digimer's guide at > https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial with a couple > of differences - I'm using DRBD 8.4.2-1 from ELrepo and Digimer's own > rhcs_fence script, currently 2.5.0. My configuration is (full output > of drbdadm dump is below) has three resources, two with two volumes Which is the problem, I guess. the get_local_resource_state() function in the rhcs_fence script assumes exactly one minor. But for multi-volume resources, a list of minor numbers is passed in. So grepping for /^ *"1 2 3 4 5":/ in /proc/drbd fails, it thinks it is not even UpToDate itself, and is then refusing to do anything else. > each , one with just the resource itself. There are three CLVM VGs, > two with volume for the two resources, one with a single resource. The > latter has a GFS2 filesystem on. Everything works as expected, with > the exception of fencing. > > If I freeze a node with a echo c>/proc/sysrq-trigger, the fencing > routines kick in. When this first happens I get: > > # cat /proc/drbd > version: 8.4.2 (api:1/proto:86-101) > GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by > dag at Build64R6, 2012-09-06 08:16:10 > 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- > ns:0 nr:0 dw:0 dr:1588 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- > ns:0 nr:0 dw:16 dr:6552 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > 2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- > ns:64 nr:0 dw:64 dr:11980 al:3 bm:3 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > 3: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- > ns:0 nr:0 dw:0 dr:1436 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > 4: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- > ns:0 nr:0 dw:12 dr:5636 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > > After a while, when rhcs_fence returns after being called against r2 I get: > > # cat /proc/drbd > version: 8.4.2 (api:1/proto:86-101) > GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by > dag at Build64R6, 2012-09-06 08:16:10 > 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- > ns:0 nr:0 dw:0 dr:1588 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- > ns:0 nr:0 dw:16 dr:6552 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > 2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r----- > ns:64 nr:0 dw:64 dr:11988 al:3 bm:3 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > 3: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- > ns:0 nr:0 dw:0 dr:1436 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > 4: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- > ns:0 nr:0 dw:12 dr:5636 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 > > The other nodes gets fenced and reboots. If I don't run drbdadm > resume-io all before drbd comes up on the other node things get a > hairy - if I reload drbd on the node that stayed up it generally > sorted out any problems. If I do run drbdadm resume-io all, when drbd > started on the previously fenced node it starts without problem. > > r2 continues to be responsive during the time the other node takes to > reboot, but r0 and r1 members (0,1,3,4 in the output above) don't > respond and LVM queries hang until the other node comes back. > > What I see in the logs r2 looks fine and I get: > Oct 25 20:38:26 node2 rhcs_fence: 266; DEBUG: Attempt to fence node: > [node1] exited with: [0] > Oct 25 20:38:26 node2 rhcs_fence: 276; Fencing of: [node1] succeeded! > > at the end, after the delay. > > In the meantime attempts are made to fence r0 and r1 and I get: > Oct 25 20:38:09 node2 rhcs_fence: 438; Local resource: [r1], minor: [1 > 4] is NOT 'UpToDate', will not fence peer. > Oct 25 20:38:09 node2 kernel: d-con r1: helper command: /sbin/drbdadm > fence-peer r1 exit code 1 (0x100) > Oct 25 20:38:09 node2 kernel: d-con r1: fence-peer helper broken, returned 1 > > I'm at the point where I'm not sure how safely to handle this, so any > help is greatly appreciated. > > Matt > > > > # drbdadm dump > # /etc/drbd.conf > common { > net { > protocol C; > allow-two-primaries yes; > after-sb-0pri discard-zero-changes; > after-sb-1pri discard-secondary; > after-sb-2pri disconnect; > } > disk { > fencing resource-and-stonith; > } > startup { > wfc-timeout 300; > degr-wfc-timeout 120; > become-primary-on both; > } > handlers { > pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger > ; reboot -f"; > pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger > ; reboot -f"; > local-io-error "/usr/lib/drbd/notify-io-error.sh; > /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > > /proc/sysrq-trigger ; halt -f"; > fence-peer /usr/local/lib/drbd/rhcs_fence; > split-brain "/usr/lib/drbd/notify-split-brain.sh root"; > out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; > } > } > > # resource r0 on node2: not ignored, not stacked > # defined at /etc/drbd.d/r0.res:1 > resource r0 { > on node1 { > volume 0 { > device /dev/drbd0 minor 0; > disk /dev/sda4; > meta-disk internal; > } > volume 1 { > device /dev/drbd3 minor 3; > disk /dev/sdc2; > meta-disk internal; > } > address ipv6 [fd5f:a481:cea4:2f50:460e:88af:494c:209f]:7788; > } > on node2 { > volume 0 { > device /dev/drbd0 minor 0; > disk /dev/sda4; > meta-disk internal; > } > volume 1 { > device /dev/drbd3 minor 3; > disk /dev/sdc2; > meta-disk internal; > } > address ipv6 [fd5f:a481:cea4:2f50:9e28:2c8d:fd75:74f6]:7788; > } > } > > # resource r1 on node2: not ignored, not stacked > # defined at /etc/drbd.d/r1.res:1 > resource r1 { > on node1 { > volume 0 { > device /dev/drbd1 minor 1; > disk /dev/sdb4; > meta-disk internal; > } > volume 1 { > device /dev/drbd4 minor 4; > disk /dev/sdc3; > meta-disk internal; > } > address ipv6 [fd5f:a481:cea4:2f50:460e:88af:494c:209f]:7789; > } > on node2 { > volume 0 { > device /dev/drbd1 minor 1; > disk /dev/sdb4; > meta-disk internal; > } > volume 1 { > device /dev/drbd4 minor 4; > disk /dev/sdc3; > meta-disk internal; > } > address ipv6 [fd5f:a481:cea4:2f50:9e28:2c8d:fd75:74f6]:7789; > } > } > > # resource r2 on node2: not ignored, not stacked > # defined at /etc/drbd.d/r2.res:1 > resource r2 { > on node1 { > device /dev/drbd2 minor 2; > disk /dev/sdc1; > meta-disk internal; > address ipv6 [fd5f:a481:cea4:2f50:460e:88af:494c:209f]:7790; > } > on node2 { > device /dev/drbd2 minor 2; > disk /dev/sdc1; > meta-disk internal; > address ipv6 [fd5f:a481:cea4:2f50:9e28:2c8d:fd75:74f6]:7790; > } > } > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed