Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi there, I've been following Digimer's guide at https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial with a couple of differences - I'm using DRBD 8.4.2-1 from ELrepo and Digimer's own rhcs_fence script, currently 2.5.0. My configuration is (full output of drbdadm dump is below) has three resources, two with two volumes each , one with just the resource itself. There are three CLVM VGs, two with volume for the two resources, one with a single resource. The latter has a GFS2 filesystem on. Everything works as expected, with the exception of fencing. If I freeze a node with a echo c>/proc/sysrq-trigger, the fencing routines kick in. When this first happens I get: # cat /proc/drbd version: 8.4.2 (api:1/proto:86-101) GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by dag at Build64R6, 2012-09-06 08:16:10 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- ns:0 nr:0 dw:0 dr:1588 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- ns:0 nr:0 dw:16 dr:6552 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- ns:64 nr:0 dw:64 dr:11980 al:3 bm:3 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 3: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- ns:0 nr:0 dw:0 dr:1436 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 4: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- ns:0 nr:0 dw:12 dr:5636 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 After a while, when rhcs_fence returns after being called against r2 I get: # cat /proc/drbd version: 8.4.2 (api:1/proto:86-101) GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by dag at Build64R6, 2012-09-06 08:16:10 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- ns:0 nr:0 dw:0 dr:1588 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- ns:0 nr:0 dw:16 dr:6552 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r----- ns:64 nr:0 dw:64 dr:11988 al:3 bm:3 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 3: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- ns:0 nr:0 dw:0 dr:1436 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 4: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d- ns:0 nr:0 dw:12 dr:5636 al:0 bm:1 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 The other nodes gets fenced and reboots. If I don't run drbdadm resume-io all before drbd comes up on the other node things get a hairy - if I reload drbd on the node that stayed up it generally sorted out any problems. If I do run drbdadm resume-io all, when drbd started on the previously fenced node it starts without problem. r2 continues to be responsive during the time the other node takes to reboot, but r0 and r1 members (0,1,3,4 in the output above) don't respond and LVM queries hang until the other node comes back. What I see in the logs r2 looks fine and I get: Oct 25 20:38:26 node2 rhcs_fence: 266; DEBUG: Attempt to fence node: [node1] exited with: [0] Oct 25 20:38:26 node2 rhcs_fence: 276; Fencing of: [node1] succeeded! at the end, after the delay. In the meantime attempts are made to fence r0 and r1 and I get: Oct 25 20:38:09 node2 rhcs_fence: 438; Local resource: [r1], minor: [1 4] is NOT 'UpToDate', will not fence peer. Oct 25 20:38:09 node2 kernel: d-con r1: helper command: /sbin/drbdadm fence-peer r1 exit code 1 (0x100) Oct 25 20:38:09 node2 kernel: d-con r1: fence-peer helper broken, returned 1 I'm at the point where I'm not sure how safely to handle this, so any help is greatly appreciated. Matt # drbdadm dump # /etc/drbd.conf common { net { protocol C; allow-two-primaries yes; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; } disk { fencing resource-and-stonith; } startup { wfc-timeout 300; degr-wfc-timeout 120; become-primary-on both; } handlers { pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; fence-peer /usr/local/lib/drbd/rhcs_fence; split-brain "/usr/lib/drbd/notify-split-brain.sh root"; out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; } } # resource r0 on node2: not ignored, not stacked # defined at /etc/drbd.d/r0.res:1 resource r0 { on node1 { volume 0 { device /dev/drbd0 minor 0; disk /dev/sda4; meta-disk internal; } volume 1 { device /dev/drbd3 minor 3; disk /dev/sdc2; meta-disk internal; } address ipv6 [fd5f:a481:cea4:2f50:460e:88af:494c:209f]:7788; } on node2 { volume 0 { device /dev/drbd0 minor 0; disk /dev/sda4; meta-disk internal; } volume 1 { device /dev/drbd3 minor 3; disk /dev/sdc2; meta-disk internal; } address ipv6 [fd5f:a481:cea4:2f50:9e28:2c8d:fd75:74f6]:7788; } } # resource r1 on node2: not ignored, not stacked # defined at /etc/drbd.d/r1.res:1 resource r1 { on node1 { volume 0 { device /dev/drbd1 minor 1; disk /dev/sdb4; meta-disk internal; } volume 1 { device /dev/drbd4 minor 4; disk /dev/sdc3; meta-disk internal; } address ipv6 [fd5f:a481:cea4:2f50:460e:88af:494c:209f]:7789; } on node2 { volume 0 { device /dev/drbd1 minor 1; disk /dev/sdb4; meta-disk internal; } volume 1 { device /dev/drbd4 minor 4; disk /dev/sdc3; meta-disk internal; } address ipv6 [fd5f:a481:cea4:2f50:9e28:2c8d:fd75:74f6]:7789; } } # resource r2 on node2: not ignored, not stacked # defined at /etc/drbd.d/r2.res:1 resource r2 { on node1 { device /dev/drbd2 minor 2; disk /dev/sdc1; meta-disk internal; address ipv6 [fd5f:a481:cea4:2f50:460e:88af:494c:209f]:7790; } on node2 { device /dev/drbd2 minor 2; disk /dev/sdc1; meta-disk internal; address ipv6 [fd5f:a481:cea4:2f50:9e28:2c8d:fd75:74f6]:7790; } }