Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Having an issue on a newly built CentOS 7.2.1511 NFS cluster with DRBD (drbd84-utils-8.9.5-1 with kmod-drbd84-8.4.7-1_1). At this point, the resources consist of a cluster address, a DRBD device mirroring between the two cluster nodes, the file system, and the nfs-server resource. The resources all behave properly until an extended failover or outage. I have tested failover in several ways ("pcs cluster standby", "pcs cluster stop", "init 0", "init 6", "echo b > /proc/sysrq-trigger", etc.) and the symptoms are that, until the killed node is brought back into the cluster, failover never seems to complete. The DRBD device appears on the remaining node to be in a "Secondary/Unknown" state, and the resources end up looking like: # pcs status Cluster name: nfscluster Last updated: Wed Mar 16 12:05:33 2016 Last change: Wed Mar 16 12:04:46 2016 by root via cibadmin on nfsnode01 Stack: corosync Current DC: nfsnode01 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum 2 nodes and 5 resources configured Online: [ nfsnode01 ] OFFLINE: [ nfsnode02 ] Full list of resources: nfsVIP (ocf::heartbeat:IPaddr2): Started nfsnode01 nfs-server (systemd:nfs-server): Stopped Master/Slave Set: drbd_master [drbd_dev] Slaves: [ nfsnode01 ] Stopped: [ nfsnode02 ] drbd_fs (ocf::heartbeat:Filesystem): Stopped PCSD Status: nfsnode01: Online nfsnode02: Online Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled As soon as I bring the second node back online, the failover completes. But this is obviously not a good state, as an extended outage for any reason on one node essentially kills the cluster services. There's obviously something I've missed in configuring the resources, but I haven't been able to pinpoint it yet. Perusing the logs, it appears that, upon the initial failure, DRBD does in fact promote the drbd_master resource, but immediately after that, pengine calls for it to be demoted for reasons I haven't been able to determine yet, but seems to be tied to the fencing configuration. I can see that the crm-fence-peer.sh script is called, but it almost seems like it's fencing the wrong node... Indeed, I do see that it adds a -INFINITY location constraint for the surviving node, which would explain the decision to demote the DRBD master. My DRBD resource looks like this: # cat /etc/drbd.d/drbd0.res resource drbd0 { protocol C; startup { wfc-timeout 0; degr-wfc-timeout 120; } disk { on-io-error detach; fencing resource-only; } handlers { fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; } on nfsnode01 { device /dev/drbd0; disk /dev/vg_nfs/lv_drbd0; meta-disk internal; address 10.0.0.2:7788; } on nfsnode02 { device /dev/drbd0; disk /dev/vg_nfs/lv_drbd0; meta-disk internal; address 10.0.0.3:7788; } } If I comment out the three lines having to do with fencing, the failover works properly. But I'd prefer to have the fencing there in the odd chance that we end up with a split brain instead of just a node outage... And, here's "pcs config --full": # pcs config --full Cluster Name: nfscluster Corosync Nodes: nfsnode01 nfsnode02 Pacemaker Nodes: nfsnode01 nfsnode02 Resources: Resource: nfsVIP (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.0.0.1 cidr_netmask=24 Operations: start interval=0s timeout=20s (nfsVIP-start-interval-0s) stop interval=0s timeout=20s (nfsVIP-stop-interval-0s) monitor interval=15s (nfsVIP-monitor-interval-15s) Resource: nfs-server (class=systemd type=nfs-server) Operations: monitor interval=60s (nfs-server-monitor-interval-60s) Master: drbd_master Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true Resource: drbd_dev (class=ocf provider=linbit type=drbd) Attributes: drbd_resource=drbd0 Operations: start interval=0s timeout=240 (drbd_dev-start-interval-0s) promote interval=0s timeout=90 (drbd_dev-promote-interval-0s) demote interval=0s timeout=90 (drbd_dev-demote-interval-0s) stop interval=0s timeout=100 (drbd_dev-stop-interval-0s) monitor interval=29s role=Master (drbd_dev-monitor-interval-29s) monitor interval=31s role=Slave (drbd_dev-monitor-interval-31s) Resource: drbd_fs (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/drbd0 directory=/exports/drbd0 fstype=xfs Operations: start interval=0s timeout=60 (drbd_fs-start-interval-0s) stop interval=0s timeout=60 (drbd_fs-stop-interval-0s) monitor interval=20 timeout=40 (drbd_fs-monitor-interval-20) Stonith Devices: Fencing Levels: Location Constraints: Ordering Constraints: start nfsVIP then start nfs-server (kind:Mandatory) (id:order-nfsVIP-nfs-server-mandatory) start drbd_fs then start nfs-server (kind:Mandatory) (id:order-drbd_fs-nfs-server-mandatory) promote drbd_master then start drbd_fs (kind:Mandatory) (id:order-drbd_master-drbd_fs-mandatory) Colocation Constraints: nfs-server with nfsVIP (score:INFINITY) (id:colocation-nfs-server-nfsVIP-INFINITY) nfs-server with drbd_fs (score:INFINITY) (id:colocation-nfs-server-drbd_fs-INFINITY) drbd_fs with drbd_master (score:INFINITY) (with-rsc-role:Master) (id:colocation-drbd_fs-drbd_master-INFINITY) Resources Defaults: resource-stickiness: 100 failure-timeout: 60 Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: nfscluster dc-version: 1.1.13-10.el7_2.2-44eb2dd have-watchdog: false maintenance-mode: false stonith-enabled: false -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160316/adb18ebd/attachment.htm>