Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 16/03/16 01:17 PM, Tim Walberg wrote: > Having an issue on a newly built CentOS 7.2.1511 NFS cluster with DRBD > (drbd84-utils-8.9.5-1 with kmod-drbd84-8.4.7-1_1). At this point, the > resources consist of a cluster address, a DRBD device mirroring between > the two cluster nodes, the file system, and the nfs-server resource. The > resources all behave properly until an extended failover or outage. > > I have tested failover in several ways ("pcs cluster standby", "pcs > cluster stop", "init 0", "init 6", "echo b > /proc/sysrq-trigger", etc.) > and the symptoms are that, until the killed node is brought back into > the cluster, failover never seems to complete. The DRBD device appears > on the remaining node to be in a "Secondary/Unknown" state, and the > resources end up looking like: > > # pcs status > Cluster name: nfscluster > Last updated: Wed Mar 16 12:05:33 2016 Last change: Wed Mar 16 > 12:04:46 2016 by root via cibadmin on nfsnode01 > Stack: corosync > Current DC: nfsnode01 (version 1.1.13-10.el7_2.2-44eb2dd) - partition > with quorum > 2 nodes and 5 resources configured > > Online: [ nfsnode01 ] > OFFLINE: [ nfsnode02 ] > > Full list of resources: > > nfsVIP (ocf::heartbeat:IPaddr2): Started nfsnode01 > nfs-server (systemd:nfs-server): Stopped > Master/Slave Set: drbd_master [drbd_dev] > Slaves: [ nfsnode01 ] > Stopped: [ nfsnode02 ] > drbd_fs (ocf::heartbeat:Filesystem): Stopped > > PCSD Status: > nfsnode01: Online > nfsnode02: Online > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > > As soon as I bring the second node back online, the failover completes. > But this is obviously not a good state, as an extended outage for any > reason on one node essentially kills the cluster services. There's > obviously something I've missed in configuring the resources, but I > haven't been able to pinpoint it yet. > > Perusing the logs, it appears that, upon the initial failure, DRBD does > in fact promote the drbd_master resource, but immediately after that, > pengine calls for it to be demoted for reasons I haven't been able to > determine yet, but seems to be tied to the fencing configuration. I can > see that the crm-fence-peer.sh script is called, but it almost seems > like it's fencing the wrong node... Indeed, I do see that it adds a > -INFINITY location constraint for the surviving node, which would > explain the decision to demote the DRBD master. > > My DRBD resource looks like this: > > # cat /etc/drbd.d/drbd0.res > resource drbd0 { > > protocol C; > startup { wfc-timeout 0; degr-wfc-timeout 120; } > > disk { > on-io-error detach; > fencing resource-only; This should be 'resource-and-stonith;', but alone won't do anything until pacemaker's stonith is working. > } > > handlers { > fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; > after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; > } > > on nfsnode01 { > device /dev/drbd0; > disk /dev/vg_nfs/lv_drbd0; > meta-disk internal; > address 10.0.0.2:7788 <http://10.0.0.2:7788>; > } > > on nfsnode02 { > device /dev/drbd0; > disk /dev/vg_nfs/lv_drbd0; > meta-disk internal; > address 10.0.0.3:7788 <http://10.0.0.3:7788>; > } > } > > If I comment out the three lines having to do with fencing, the failover > works properly. But I'd prefer to have the fencing there in the odd > chance that we end up with a split brain instead of just a node outage... > > And, here's "pcs config --full": > > # pcs config --full > Cluster Name: nfscluster > Corosync Nodes: > nfsnode01 nfsnode02 > Pacemaker Nodes: > nfsnode01 nfsnode02 > > Resources: > Resource: nfsVIP (class=ocf provider=heartbeat type=IPaddr2) > Attributes: ip=10.0.0.1 cidr_netmask=24 > Operations: start interval=0s timeout=20s (nfsVIP-start-interval-0s) > stop interval=0s timeout=20s (nfsVIP-stop-interval-0s) > monitor interval=15s (nfsVIP-monitor-interval-15s) > Resource: nfs-server (class=systemd type=nfs-server) > Operations: monitor interval=60s (nfs-server-monitor-interval-60s) > Master: drbd_master > Meta Attrs: master-max=1 master-node-max=1 clone-max=2 > clone-node-max=1 notify=true > Resource: drbd_dev (class=ocf provider=linbit type=drbd) > Attributes: drbd_resource=drbd0 > Operations: start interval=0s timeout=240 (drbd_dev-start-interval-0s) > promote interval=0s timeout=90 (drbd_dev-promote-interval-0s) > demote interval=0s timeout=90 (drbd_dev-demote-interval-0s) > stop interval=0s timeout=100 (drbd_dev-stop-interval-0s) > monitor interval=29s role=Master > (drbd_dev-monitor-interval-29s) > monitor interval=31s role=Slave > (drbd_dev-monitor-interval-31s) > Resource: drbd_fs (class=ocf provider=heartbeat type=Filesystem) > Attributes: device=/dev/drbd0 directory=/exports/drbd0 fstype=xfs > Operations: start interval=0s timeout=60 (drbd_fs-start-interval-0s) > stop interval=0s timeout=60 (drbd_fs-stop-interval-0s) > monitor interval=20 timeout=40 (drbd_fs-monitor-interval-20) > > Stonith Devices: > Fencing Levels: > > Location Constraints: > Ordering Constraints: > start nfsVIP then start nfs-server (kind:Mandatory) > (id:order-nfsVIP-nfs-server-mandatory) > start drbd_fs then start nfs-server (kind:Mandatory) > (id:order-drbd_fs-nfs-server-mandatory) > promote drbd_master then start drbd_fs (kind:Mandatory) > (id:order-drbd_master-drbd_fs-mandatory) > Colocation Constraints: > nfs-server with nfsVIP (score:INFINITY) > (id:colocation-nfs-server-nfsVIP-INFINITY) > nfs-server with drbd_fs (score:INFINITY) > (id:colocation-nfs-server-drbd_fs-INFINITY) > drbd_fs with drbd_master (score:INFINITY) (with-rsc-role:Master) > (id:colocation-drbd_fs-drbd_master-INFINITY) > > Resources Defaults: > resource-stickiness: 100 > failure-timeout: 60 > Operations Defaults: > No defaults set > > Cluster Properties: > cluster-infrastructure: corosync > cluster-name: nfscluster > dc-version: 1.1.13-10.el7_2.2-44eb2dd > have-watchdog: false > maintenance-mode: false > stonith-enabled: false Configure *and test* stonith in pacemaker first, then DRBD will hook into it and use it properly. DRBD simply asks pacemaker to do the fence, but you currently don't have it setup. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?