Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Is there a way to make this work properly without STONITH? I forgot to mention that both nodes are virtual machines (QEMU/KVM), which makes STONITH a minor challenge. Also, since these symptoms occur even under "pcs cluster standby", where STONITH *shouldn't* be invoked, I'm not sure if that's the entire answer. On 03/16/2016 13:34 -0400, Digimer wrote: >> On 16/03/16 01:17 PM, Tim Walberg wrote: >> > Having an issue on a newly built CentOS 7.2.1511 NFS cluster with DRBD >> > (drbd84-utils-8.9.5-1 with kmod-drbd84-8.4.7-1_1). At this point, the >> > resources consist of a cluster address, a DRBD device mirroring between >> > the two cluster nodes, the file system, and the nfs-server resource. The >> > resources all behave properly until an extended failover or outage. >> > >> > I have tested failover in several ways ("pcs cluster standby", "pcs >> > cluster stop", "init 0", "init 6", "echo b > /proc/sysrq-trigger", etc.) >> > and the symptoms are that, until the killed node is brought back into >> > the cluster, failover never seems to complete. The DRBD device appears >> > on the remaining node to be in a "Secondary/Unknown" state, and the >> > resources end up looking like: >> > >> > # pcs status >> > Cluster name: nfscluster >> > Last updated: Wed Mar 16 12:05:33 2016 Last change: Wed Mar 16 >> > 12:04:46 2016 by root via cibadmin on nfsnode01 >> > Stack: corosync >> > Current DC: nfsnode01 (version 1.1.13-10.el7_2.2-44eb2dd) - partition >> > with quorum >> > 2 nodes and 5 resources configured >> > >> > Online: [ nfsnode01 ] >> > OFFLINE: [ nfsnode02 ] >> > >> > Full list of resources: >> > >> > nfsVIP (ocf::heartbeat:IPaddr2): Started nfsnode01 >> > nfs-server (systemd:nfs-server): Stopped >> > Master/Slave Set: drbd_master [drbd_dev] >> > Slaves: [ nfsnode01 ] >> > Stopped: [ nfsnode02 ] >> > drbd_fs (ocf::heartbeat:Filesystem): Stopped >> > >> > PCSD Status: >> > nfsnode01: Online >> > nfsnode02: Online >> > >> > Daemon Status: >> > corosync: active/enabled >> > pacemaker: active/enabled >> > pcsd: active/enabled >> > >> > As soon as I bring the second node back online, the failover completes. >> > But this is obviously not a good state, as an extended outage for any >> > reason on one node essentially kills the cluster services. There's >> > obviously something I've missed in configuring the resources, but I >> > haven't been able to pinpoint it yet. >> > >> > Perusing the logs, it appears that, upon the initial failure, DRBD does >> > in fact promote the drbd_master resource, but immediately after that, >> > pengine calls for it to be demoted for reasons I haven't been able to >> > determine yet, but seems to be tied to the fencing configuration. I can >> > see that the crm-fence-peer.sh script is called, but it almost seems >> > like it's fencing the wrong node... Indeed, I do see that it adds a >> > -INFINITY location constraint for the surviving node, which would >> > explain the decision to demote the DRBD master. >> > >> > My DRBD resource looks like this: >> > >> > # cat /etc/drbd.d/drbd0.res >> > resource drbd0 { >> > >> > protocol C; >> > startup { wfc-timeout 0; degr-wfc-timeout 120; } >> > >> > disk { >> > on-io-error detach; >> > fencing resource-only; >> >> This should be 'resource-and-stonith;', but alone won't do anything >> until pacemaker's stonith is working. >> >> > } >> > >> > handlers { >> > fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; >> > after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; >> > } >> > >> > on nfsnode01 { >> > device /dev/drbd0; >> > disk /dev/vg_nfs/lv_drbd0; >> > meta-disk internal; >> > address 10.0.0.2:7788 <http://10.0.0.2:7788>; >> > } >> > >> > on nfsnode02 { >> > device /dev/drbd0; >> > disk /dev/vg_nfs/lv_drbd0; >> > meta-disk internal; >> > address 10.0.0.3:7788 <http://10.0.0.3:7788>; >> > } >> > } >> > >> > If I comment out the three lines having to do with fencing, the failover >> > works properly. But I'd prefer to have the fencing there in the odd >> > chance that we end up with a split brain instead of just a node outage... >> > >> > And, here's "pcs config --full": >> > >> > # pcs config --full >> > Cluster Name: nfscluster >> > Corosync Nodes: >> > nfsnode01 nfsnode02 >> > Pacemaker Nodes: >> > nfsnode01 nfsnode02 >> > >> > Resources: >> > Resource: nfsVIP (class=ocf provider=heartbeat type=IPaddr2) >> > Attributes: ip=10.0.0.1 cidr_netmask=24 >> > Operations: start interval=0s timeout=20s (nfsVIP-start-interval-0s) >> > stop interval=0s timeout=20s (nfsVIP-stop-interval-0s) >> > monitor interval=15s (nfsVIP-monitor-interval-15s) >> > Resource: nfs-server (class=systemd type=nfs-server) >> > Operations: monitor interval=60s (nfs-server-monitor-interval-60s) >> > Master: drbd_master >> > Meta Attrs: master-max=1 master-node-max=1 clone-max=2 >> > clone-node-max=1 notify=true >> > Resource: drbd_dev (class=ocf provider=linbit type=drbd) >> > Attributes: drbd_resource=drbd0 >> > Operations: start interval=0s timeout=240 (drbd_dev-start-interval-0s) >> > promote interval=0s timeout=90 (drbd_dev-promote-interval-0s) >> > demote interval=0s timeout=90 (drbd_dev-demote-interval-0s) >> > stop interval=0s timeout=100 (drbd_dev-stop-interval-0s) >> > monitor interval=29s role=Master >> > (drbd_dev-monitor-interval-29s) >> > monitor interval=31s role=Slave >> > (drbd_dev-monitor-interval-31s) >> > Resource: drbd_fs (class=ocf provider=heartbeat type=Filesystem) >> > Attributes: device=/dev/drbd0 directory=/exports/drbd0 fstype=xfs >> > Operations: start interval=0s timeout=60 (drbd_fs-start-interval-0s) >> > stop interval=0s timeout=60 (drbd_fs-stop-interval-0s) >> > monitor interval=20 timeout=40 (drbd_fs-monitor-interval-20) >> > >> > Stonith Devices: >> > Fencing Levels: >> > >> > Location Constraints: >> > Ordering Constraints: >> > start nfsVIP then start nfs-server (kind:Mandatory) >> > (id:order-nfsVIP-nfs-server-mandatory) >> > start drbd_fs then start nfs-server (kind:Mandatory) >> > (id:order-drbd_fs-nfs-server-mandatory) >> > promote drbd_master then start drbd_fs (kind:Mandatory) >> > (id:order-drbd_master-drbd_fs-mandatory) >> > Colocation Constraints: >> > nfs-server with nfsVIP (score:INFINITY) >> > (id:colocation-nfs-server-nfsVIP-INFINITY) >> > nfs-server with drbd_fs (score:INFINITY) >> > (id:colocation-nfs-server-drbd_fs-INFINITY) >> > drbd_fs with drbd_master (score:INFINITY) (with-rsc-role:Master) >> > (id:colocation-drbd_fs-drbd_master-INFINITY) >> > >> > Resources Defaults: >> > resource-stickiness: 100 >> > failure-timeout: 60 >> > Operations Defaults: >> > No defaults set >> > >> > Cluster Properties: >> > cluster-infrastructure: corosync >> > cluster-name: nfscluster >> > dc-version: 1.1.13-10.el7_2.2-44eb2dd >> > have-watchdog: false >> > maintenance-mode: false >> > stonith-enabled: false >> >> Configure *and test* stonith in pacemaker first, then DRBD will hook >> into it and use it properly. DRBD simply asks pacemaker to do the fence, >> but you currently don't have it setup. >> >> -- >> Digimer >> Papers and Projects: https://alteeve.ca/w/ >> What if the cure for cancer is trapped in the mind of a person without >> access to education? >> _______________________________________________ >> drbd-user mailing list >> drbd-user at lists.linbit.com >> http://lists.linbit.com/mailman/listinfo/drbd-user End of included message -- twalberg at gmail.com