Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Oct 25, 2012 at 09:30:30PM +0100, Matt Willsher wrote:
> Hi there,
>
> I've been following Digimer's guide at
> https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial with a couple
> of differences - I'm using DRBD 8.4.2-1 from ELrepo and Digimer's own
> rhcs_fence script, currently 2.5.0. My configuration is (full output
> of drbdadm dump is below) has three resources, two with two volumes
Which is the problem, I guess.
the get_local_resource_state() function in the rhcs_fence script
assumes exactly one minor.
But for multi-volume resources, a list of minor numbers is passed in.
So grepping for /^ *"1 2 3 4 5":/ in /proc/drbd fails,
it thinks it is not even UpToDate itself,
and is then refusing to do anything else.
> each , one with just the resource itself. There are three CLVM VGs,
> two with volume for the two resources, one with a single resource. The
> latter has a GFS2 filesystem on. Everything works as expected, with
> the exception of fencing.
>
> If I freeze a node with a echo c>/proc/sysrq-trigger, the fencing
> routines kick in. When this first happens I get:
>
> # cat /proc/drbd
> version: 8.4.2 (api:1/proto:86-101)
> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by
> dag at Build64R6, 2012-09-06 08:16:10
> 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
> ns:0 nr:0 dw:0 dr:1588 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
> ns:0 nr:0 dw:16 dr:6552 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
> ns:64 nr:0 dw:64 dr:11980 al:3 bm:3 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 3: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
> ns:0 nr:0 dw:0 dr:1436 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 4: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
> ns:0 nr:0 dw:12 dr:5636 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>
> After a while, when rhcs_fence returns after being called against r2 I get:
>
> # cat /proc/drbd
> version: 8.4.2 (api:1/proto:86-101)
> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by
> dag at Build64R6, 2012-09-06 08:16:10
> 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
> ns:0 nr:0 dw:0 dr:1588 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
> ns:0 nr:0 dw:16 dr:6552 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
> ns:64 nr:0 dw:64 dr:11988 al:3 bm:3 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 3: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
> ns:0 nr:0 dw:0 dr:1436 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 4: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
> ns:0 nr:0 dw:12 dr:5636 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>
> The other nodes gets fenced and reboots. If I don't run drbdadm
> resume-io all before drbd comes up on the other node things get a
> hairy - if I reload drbd on the node that stayed up it generally
> sorted out any problems. If I do run drbdadm resume-io all, when drbd
> started on the previously fenced node it starts without problem.
>
> r2 continues to be responsive during the time the other node takes to
> reboot, but r0 and r1 members (0,1,3,4 in the output above) don't
> respond and LVM queries hang until the other node comes back.
>
> What I see in the logs r2 looks fine and I get:
> Oct 25 20:38:26 node2 rhcs_fence: 266; DEBUG: Attempt to fence node:
> [node1] exited with: [0]
> Oct 25 20:38:26 node2 rhcs_fence: 276; Fencing of: [node1] succeeded!
>
> at the end, after the delay.
>
> In the meantime attempts are made to fence r0 and r1 and I get:
> Oct 25 20:38:09 node2 rhcs_fence: 438; Local resource: [r1], minor: [1
> 4] is NOT 'UpToDate', will not fence peer.
> Oct 25 20:38:09 node2 kernel: d-con r1: helper command: /sbin/drbdadm
> fence-peer r1 exit code 1 (0x100)
> Oct 25 20:38:09 node2 kernel: d-con r1: fence-peer helper broken, returned 1
>
> I'm at the point where I'm not sure how safely to handle this, so any
> help is greatly appreciated.
>
> Matt
>
>
>
> # drbdadm dump
> # /etc/drbd.conf
> common {
> net {
> protocol C;
> allow-two-primaries yes;
> after-sb-0pri discard-zero-changes;
> after-sb-1pri discard-secondary;
> after-sb-2pri disconnect;
> }
> disk {
> fencing resource-and-stonith;
> }
> startup {
> wfc-timeout 300;
> degr-wfc-timeout 120;
> become-primary-on both;
> }
> handlers {
> pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger
> ; reboot -f";
> pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger
> ; reboot -f";
> local-io-error "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o >
> /proc/sysrq-trigger ; halt -f";
> fence-peer /usr/local/lib/drbd/rhcs_fence;
> split-brain "/usr/lib/drbd/notify-split-brain.sh root";
> out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
> }
> }
>
> # resource r0 on node2: not ignored, not stacked
> # defined at /etc/drbd.d/r0.res:1
> resource r0 {
> on node1 {
> volume 0 {
> device /dev/drbd0 minor 0;
> disk /dev/sda4;
> meta-disk internal;
> }
> volume 1 {
> device /dev/drbd3 minor 3;
> disk /dev/sdc2;
> meta-disk internal;
> }
> address ipv6 [fd5f:a481:cea4:2f50:460e:88af:494c:209f]:7788;
> }
> on node2 {
> volume 0 {
> device /dev/drbd0 minor 0;
> disk /dev/sda4;
> meta-disk internal;
> }
> volume 1 {
> device /dev/drbd3 minor 3;
> disk /dev/sdc2;
> meta-disk internal;
> }
> address ipv6 [fd5f:a481:cea4:2f50:9e28:2c8d:fd75:74f6]:7788;
> }
> }
>
> # resource r1 on node2: not ignored, not stacked
> # defined at /etc/drbd.d/r1.res:1
> resource r1 {
> on node1 {
> volume 0 {
> device /dev/drbd1 minor 1;
> disk /dev/sdb4;
> meta-disk internal;
> }
> volume 1 {
> device /dev/drbd4 minor 4;
> disk /dev/sdc3;
> meta-disk internal;
> }
> address ipv6 [fd5f:a481:cea4:2f50:460e:88af:494c:209f]:7789;
> }
> on node2 {
> volume 0 {
> device /dev/drbd1 minor 1;
> disk /dev/sdb4;
> meta-disk internal;
> }
> volume 1 {
> device /dev/drbd4 minor 4;
> disk /dev/sdc3;
> meta-disk internal;
> }
> address ipv6 [fd5f:a481:cea4:2f50:9e28:2c8d:fd75:74f6]:7789;
> }
> }
>
> # resource r2 on node2: not ignored, not stacked
> # defined at /etc/drbd.d/r2.res:1
> resource r2 {
> on node1 {
> device /dev/drbd2 minor 2;
> disk /dev/sdc1;
> meta-disk internal;
> address ipv6 [fd5f:a481:cea4:2f50:460e:88af:494c:209f]:7790;
> }
> on node2 {
> device /dev/drbd2 minor 2;
> disk /dev/sdc1;
> meta-disk internal;
> address ipv6 [fd5f:a481:cea4:2f50:9e28:2c8d:fd75:74f6]:7790;
> }
> }
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed