[DRBD-user] RHCS node fencing with rhcs_fence & multiple volumes - resources staying DUnknown

Sat Oct 27 11:22:44 CEST 2012

On Thu, Oct 25, 2012 at 09:30:30PM +0100, Matt Willsher wrote:
> Hi there,
> 
> I've been following Digimer's guide at
> https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial with a couple
> of differences - I'm using DRBD 8.4.2-1 from ELrepo and Digimer's own
> rhcs_fence script, currently 2.5.0. My configuration is (full output
> of drbdadm dump is below) has three resources, two with two volumes

Which is the problem, I guess.
the get_local_resource_state() function in the rhcs_fence script
assumes exactly one minor.

But for multi-volume resources, a list of minor numbers is passed in.

So grepping for /^ *"1 2 3 4 5":/ in /proc/drbd fails,
it thinks it is not even UpToDate itself,
and is then refusing to do anything else.

> each , one with just the resource itself. There are three CLVM VGs,
> two with volume for the two resources, one with a single resource. The
> latter has a GFS2 filesystem on. Everything works as expected, with
> the exception of fencing.
> 
> If I freeze a node with a echo c>/proc/sysrq-trigger, the fencing
> routines kick in. When this first happens I get:
> 
> # cat /proc/drbd
> version: 8.4.2 (api:1/proto:86-101)
> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by
> dag at Build64R6, 2012-09-06 08:16:10
>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
>     ns:0 nr:0 dw:0 dr:1588 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
>     ns:0 nr:0 dw:16 dr:6552 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
>     ns:64 nr:0 dw:64 dr:11980 al:3 bm:3 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  3: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
>     ns:0 nr:0 dw:0 dr:1436 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  4: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
>     ns:0 nr:0 dw:12 dr:5636 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 
> After a while, when rhcs_fence returns after being called against r2 I get:
> 
> # cat /proc/drbd
> version: 8.4.2 (api:1/proto:86-101)
> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by
> dag at Build64R6, 2012-09-06 08:16:10
>  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
>     ns:0 nr:0 dw:0 dr:1588 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
>     ns:0 nr:0 dw:16 dr:6552 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-----
>     ns:64 nr:0 dw:64 dr:11988 al:3 bm:3 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  3: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
>     ns:0 nr:0 dw:0 dr:1436 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>  4: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
>     ns:0 nr:0 dw:12 dr:5636 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
> 
> The other nodes gets fenced and reboots. If I don't run drbdadm
> resume-io all before drbd comes up on the other node things get a
> hairy - if I reload drbd on the node that stayed up it generally
> sorted out any problems. If I do run drbdadm resume-io all, when drbd
> started on the previously fenced node it starts without problem.
> 
> r2 continues to be responsive during the time the other node takes to
> reboot, but r0 and r1 members (0,1,3,4 in the output above) don't
> respond and LVM queries hang until the other node comes back.
> 
> What I see in the logs r2 looks fine and I get:
> Oct 25 20:38:26 node2 rhcs_fence: 266; DEBUG: Attempt to fence node:
> [node1] exited with: [0]
> Oct 25 20:38:26 node2 rhcs_fence: 276; Fencing of: [node1] succeeded!
> 
> at the end, after the delay.
> 
> In the meantime attempts are made to fence r0 and r1 and I get:
> Oct 25 20:38:09 node2 rhcs_fence: 438; Local resource: [r1], minor: [1
> 4] is NOT 'UpToDate', will not fence peer.
> Oct 25 20:38:09 node2 kernel: d-con r1: helper command: /sbin/drbdadm
> fence-peer r1 exit code 1 (0x100)
> Oct 25 20:38:09 node2 kernel: d-con r1: fence-peer helper broken, returned 1
> 
> I'm at the point where I'm not sure how safely to handle this, so any
> help is greatly appreciated.
> 
> Matt
> 
> 
> 
> # drbdadm dump
> # /etc/drbd.conf
> common {
>     net {
>         protocol           C;
>         allow-two-primaries yes;
>         after-sb-0pri    discard-zero-changes;
>         after-sb-1pri    discard-secondary;
>         after-sb-2pri    disconnect;
>     }
>     disk {
>         fencing          resource-and-stonith;
>     }
>     startup {
>         wfc-timeout      300;
>         degr-wfc-timeout 120;
>         become-primary-on both;
>     }
>     handlers {
>         pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger
> ; reboot -f";
>         pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
> /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger
> ; reboot -f";
>         local-io-error   "/usr/lib/drbd/notify-io-error.sh;
> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o >
> /proc/sysrq-trigger ; halt -f";
>         fence-peer       /usr/local/lib/drbd/rhcs_fence;
>         split-brain      "/usr/lib/drbd/notify-split-brain.sh root";
>         out-of-sync      "/usr/lib/drbd/notify-out-of-sync.sh root";
>     }
> }
> 
> # resource r0 on node2: not ignored, not stacked
> # defined at /etc/drbd.d/r0.res:1
> resource r0 {
>     on node1 {
>         volume 0 {
>             device       /dev/drbd0 minor 0;
>             disk         /dev/sda4;
>             meta-disk    internal;
>         }
>         volume 1 {
>             device       /dev/drbd3 minor 3;
>             disk         /dev/sdc2;
>             meta-disk    internal;
>         }
>         address          ipv6 [fd5f:a481:cea4:2f50:460e:88af:494c:209f]:7788;
>     }
>     on node2 {
>         volume 0 {
>             device       /dev/drbd0 minor 0;
>             disk         /dev/sda4;
>             meta-disk    internal;
>         }
>         volume 1 {
>             device       /dev/drbd3 minor 3;
>             disk         /dev/sdc2;
>             meta-disk    internal;
>         }
>         address          ipv6 [fd5f:a481:cea4:2f50:9e28:2c8d:fd75:74f6]:7788;
>     }
> }
> 
> # resource r1 on node2: not ignored, not stacked
> # defined at /etc/drbd.d/r1.res:1
> resource r1 {
>     on node1 {
>         volume 0 {
>             device       /dev/drbd1 minor 1;
>             disk         /dev/sdb4;
>             meta-disk    internal;
>         }
>         volume 1 {
>             device       /dev/drbd4 minor 4;
>             disk         /dev/sdc3;
>             meta-disk    internal;
>         }
>         address          ipv6 [fd5f:a481:cea4:2f50:460e:88af:494c:209f]:7789;
>     }
>     on node2 {
>         volume 0 {
>             device       /dev/drbd1 minor 1;
>             disk         /dev/sdb4;
>             meta-disk    internal;
>         }
>         volume 1 {
>             device       /dev/drbd4 minor 4;
>             disk         /dev/sdc3;
>             meta-disk    internal;
>         }
>         address          ipv6 [fd5f:a481:cea4:2f50:9e28:2c8d:fd75:74f6]:7789;
>     }
> }
> 
> # resource r2 on node2: not ignored, not stacked
> # defined at /etc/drbd.d/r2.res:1
> resource r2 {
>     on node1 {
>         device           /dev/drbd2 minor 2;
>         disk             /dev/sdc1;
>         meta-disk        internal;
>         address          ipv6 [fd5f:a481:cea4:2f50:460e:88af:494c:209f]:7790;
>     }
>     on node2 {
>         device           /dev/drbd2 minor 2;
>         disk             /dev/sdc1;
>         meta-disk        internal;
>         address          ipv6 [fd5f:a481:cea4:2f50:9e28:2c8d:fd75:74f6]:7790;
>     }
> }
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed