Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
________________________________ Hi I have a basic 2 node active/passive cluster with Pacemaker (1.1.14 , pcs: 0.9.148) / CMAN (3.0.12.1) / Corosync (1.4.7) on RHEL 6.8. This cluster runs NFS on top of DRBD (8.4.4). Basically the system is working on both nodes and I can switch the resources from one node to the other. But switching resources to the other node does not work, if I try to move just one resource and have the others follow due to the location constraints. >From the logged messages I see that in this “failure case” there is NO attempt to demote/promote the DRBD clone resource. Here is my setup: ================================================================== Cluster Name: clst1 Corosync Nodes: ventsi-clst1-sync ventsi-clst2-sync Pacemaker Nodes: ventsi-clst1-sync ventsi-clst2-sync Resources: Resource: IPaddrNFS (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=xxx.xxx.xxx.xxx cidr_netmask=24 Operations: start interval=0s timeout=20s (IPaddrNFS-start-interval-0s) stop interval=0s timeout=20s (IPaddrNFS-stop-interval-0s) monitor interval=5s (IPaddrNFS-monitor-interval-5s) Resource: NFSServer (class=ocf provider=heartbeat type=nfsserver) Attributes: nfs_shared_infodir=/var/lib/nfsserversettings/ nfs_ip=xxx.xxx.xxx.xxx nfsd_args="-H xxx.xxx.xxx.xxx" Operations: start interval=0s timeout=40 (NFSServer-start-interval-0s) stop interval=0s timeout=20s (NFSServer-stop-interval-0s) monitor interval=10s timeout=20s (NFSServer-monitor-interval-10s) Master: DRBDClone Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true Resource: DRBD (class=ocf provider=linbit type=drbd) Attributes: drbd_resource=nfsdata Operations: start interval=0s timeout=240 (DRBD-start-interval-0s) promote interval=0s timeout=90 (DRBD-promote-interval-0s) demote interval=0s timeout=90 (DRBD-demote-interval-0s) stop interval=0s timeout=100 (DRBD-stop-interval-0s) monitor interval=1s timeout=5 (DRBD-monitor-interval-1s) Resource: DRBD_global_clst (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/drbd1 directory=/drbdmnts/global_clst fstype=ext4 Operations: start interval=0s timeout=60 (DRBD_global_clst-start-interval-0s) stop interval=0s timeout=60 (DRBD_global_clst-stop-interval-0s) monitor interval=20 timeout=40 (DRBD_global_clst-monitor-interval-20) Stonith Devices: Resource: ipmi-fence-clst1 (class=stonith type=fence_ipmilan) Attributes: lanplus=1 login=foo passwd=bar action=reboot ipaddr=yyy.yyy.yyy.yyy pcmk_host_check=static-list pcmk_host_list=ventsi-clst1-sync auth=password timeout=30 cipher=1 Operations: monitor interval=60s (ipmi-fence-clst1-monitor-interval-60s) Resource: ipmi-fence-clst2 (class=stonith type=fence_ipmilan) Attributes: lanplus=1 login=foo passwd=bar action=reboot ipaddr=zzz.zzz.zzz.zzz pcmk_host_check=static-list pcmk_host_list=ventsi-clst2-sync auth=password timeout=30 cipher=1 Operations: monitor interval=60s (ipmi-fence-clst2-monitor-interval-60s) Fencing Levels: Location Constraints: Resource: ipmi-fence-clst1 Disabled on: ventsi-clst1-sync (score:-INFINITY) (id:location-ipmi-fence-clst1-ventsi-clst1-sync--INFINITY) Resource: ipmi-fence-clst2 Disabled on: ventsi-clst2-sync (score:-INFINITY) (id:location-ipmi-fence-clst2-ventsi-clst2-sync--INFINITY) Ordering Constraints: start IPaddrNFS then start NFSServer (kind:Mandatory) (id:order-IPaddrNFS-NFSServer-mandatory) promote DRBDClone then start DRBD_global_clst (kind:Mandatory) (id:order-DRBDClone-DRBD_global_clst-mandatory) start DRBD_global_clst then start IPaddrNFS (kind:Mandatory) (id:order-DRBD_global_clst-IPaddrNFS-mandatory) Colocation Constraints: NFSServer with IPaddrNFS (score:INFINITY) (id:colocation-NFSServer-IPaddrNFS-INFINITY) DRBD_global_clst with DRBDClone (score:INFINITY) (id:colocation-DRBD_global_clst-DRBDClone-INFINITY) IPaddrNFS with DRBD_global_clst (score:INFINITY) (id:colocation-IPaddrNFS-DRBD_global_clst-INFINITY) Resources Defaults: resource-stickiness: INFINITY Operations Defaults: timeout: 10s Cluster Properties: cluster-infrastructure: cman dc-version: 1.1.14-8.el6-70404b0 have-watchdog: false last-lrm-refresh: 1478277432 no-quorum-policy: ignore stonith-enabled: true symmetric-cluster: true ================================================================== Initial state is e.g. this (all resources at node1): Online: [ ventsi-clst1-sync ventsi-clst2-sync ] Full list of resources: ipmi-fence-clst1 (stonith:fence_ipmilan): Started ventsi-clst2-sync ipmi-fence-clst2 (stonith:fence_ipmilan): Started ventsi-clst1-sync IPaddrNFS (ocf::heartbeat:IPaddr2): Started ventsi-clst1-sync NFSServer (ocf::heartbeat:nfsserver): Started ventsi-clst1-sync Master/Slave Set: DRBDClone [DRBD] Masters: [ ventsi-clst1-sync ] Slaves: [ ventsi-clst2-sync ] DRBD_global_clst (ocf::heartbeat:Filesystem): Started ventsi-clst1-sync ================================================================== If I shutdown the cluster at node 1 (‘pcs cluster stop’) or if I move the DRBD clone resource (‘pcs resource move DRBDClone’) all resources switch successfully to node2. I.e. the demote/promote of the DRBD clone resource is working in these cases. But if I try to move any other resource (e.g. ‘pcs resource move NFSServer’) the resources NFSServer, IPaddrNFS and DRBD_global_clst are stopped at node 1, but then already follows starting of the DRBD_global_clst resource at node2, which fails due to the missing demote/promote. As far as I can see there is some follow-up attempt to repair things partially as the resources are started again at node1 exclusive the resource which I moved due to my move command. Final state is like this: Online: [ ventsi-clst1-sync ventsi-clst2-sync ] Full list of resources: ipmi-fence-clst1 (stonith:fence_ipmilan): Started ventsi-clst2-sync ipmi-fence-clst2 (stonith:fence_ipmilan): Started ventsi-clst1-sync IPaddrNFS (ocf::heartbeat:IPaddr2): Started ventsi-clst1-sync NFSServer (ocf::heartbeat:nfsserver): Stopped Master/Slave Set: DRBDClone [DRBD] Masters: [ ventsi-clst1-sync ] Slaves: [ ventsi-clst2-sync ] DRBD_global_clst (ocf::heartbeat:Filesystem): Started ventsi-clst1-sync Failed Actions: * DRBD_global_clst_start_0 on ventsi-clst2-sync 'unknown error' (1): call=778, status=complete, exitreason='none', last-rc-change='Fri Nov 4 19:32:56 2016', queued=0ms, exec=43ms ================================================================== Here are the logged messages for this “failure case”: 2016-11-04T19:32:55.163982+01:00 ventsi-clst1 crmd[6116]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] 2016-11-04T19:32:55.168100+01:00 ventsi-clst1 pengine[6115]: notice: On loss of CCM Quorum: Ignore 2016-11-04T19:32:55.181252+01:00 ventsi-clst1 pengine[6115]: notice: Move IPaddrNFS#011(Started ventsi-clst1-sync -> ventsi-clst2-sync) 2016-11-04T19:32:55.181260+01:00 ventsi-clst1 pengine[6115]: notice: Move NFSServer#011(Started ventsi-clst1-sync -> ventsi-clst2-sync) 2016-11-04T19:32:55.181278+01:00 ventsi-clst1 pengine[6115]: notice: Move DRBD_global_clst#011(Started ventsi-clst1-sync -> ventsi-clst2-sync) <=== here no demote/promote is listed 2016-11-04T19:32:55.182385+01:00 ventsi-clst1 pengine[6115]: notice: Calculated Transition 202: /var/lib/pacemaker/pengine/pe-input-766.bz2 2016-11-04T19:32:55.182998+01:00 ventsi-clst1 crmd[6116]: notice: Initiating action 15: stop NFSServer_stop_0 on ventsi-clst1-sync (local) 2016-11-04T19:32:55.196265+01:00 ventsi-clst1 nfsserver(NFSServer)[15978]: INFO: Stopping NFS server ... 2016-11-04T19:32:55.249137+01:00 ventsi-clst1 kernel: nfsd: last server has exited, flushing export cache 2016-11-04T19:32:55.252241+01:00 ventsi-clst1 rpc.mountd[15282]: Caught signal 15, un-registering and exiting. 2016-11-04T19:32:55.632708+01:00 ventsi-clst1 nfsserver(NFSServer)[15978]: INFO: Stopping sm-notify 2016-11-04T19:32:55.650552+01:00 ventsi-clst1 nfsserver(NFSServer)[15978]: INFO: Stopping rpc.statd 2016-11-04T19:32:55.666777+01:00 ventsi-clst1 rpc.statd[15243]: Caught signal 15, un-registering and exiting 2016-11-04T19:32:56.692819+01:00 ventsi-clst1 nfsserver(NFSServer)[15978]: INFO: NFS server stopped 2016-11-04T19:32:56.695523+01:00 ventsi-clst1 crmd[6116]: notice: Operation NFSServer_stop_0: ok (node=ventsi-clst1-sync, call=1220, rc=0, cib-update=1695, confirmed=true) 2016-11-04T19:32:56.696243+01:00 ventsi-clst1 crmd[6116]: notice: Initiating action 12: stop IPaddrNFS_stop_0 on ventsi-clst1-sync (local) 2016-11-04T19:32:56.727882+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16108]: INFO: IP status = ok, IP_CIP= 2016-11-04T19:32:56.733383+01:00 ventsi-clst1 crmd[6116]: notice: Operation IPaddrNFS_stop_0: ok (node=ventsi-clst1-sync, call=1222, rc=0, cib-update=1696, confirmed=true) 2016-11-04T19:32:56.733917+01:00 ventsi-clst1 crmd[6116]: notice: Initiating action 48: stop DRBD_global_clst_stop_0 on ventsi-clst1-sync (local) 2016-11-04T19:32:56.757181+01:00 ventsi-clst1 Filesystem(DRBD_global_clst)[16163]: INFO: Running stop for /dev/drbd1 on /drbdmnts/global_clst 2016-11-04T19:32:56.764684+01:00 ventsi-clst1 Filesystem(DRBD_global_clst)[16163]: INFO: Trying to unmount /drbdmnts/global_clst 2016-11-04T19:32:56.771260+01:00 ventsi-clst1 Filesystem(DRBD_global_clst)[16163]: INFO: unmounted /drbdmnts/global_clst successfully 2016-11-04T19:32:56.776640+01:00 ventsi-clst1 crmd[6116]: notice: Operation DRBD_global_clst_stop_0: ok (node=ventsi-clst1-sync, call=1224, rc=0, cib-update=1697, confirmed=true) 2016-11-04T19:32:56.777140+01:00 ventsi-clst1 crmd[6116]: notice: Initiating action 49: start DRBD_global_clst_start_0 on ventsi-clst2-sync <=== here is the attempt to start the filesystem at the other node, although DRBD has not yet been promoted 2016-11-04T19:32:56.840137+01:00 ventsi-clst1 crmd[6116]: warning: Action 49 (DRBD_global_clst_start_0) on ventsi-clst2-sync failed (target: 0 vs. rc: 1): Error 2016-11-04T19:32:56.840158+01:00 ventsi-clst1 crmd[6116]: notice: Transition aborted by DRBD_global_clst_start_0 'modify' on ventsi-clst2-sync: Event failed (magic=0:1;49:202:0:b7941532-c74b-40cc-a8ad-27b5502b8fba, cib=0.649.4, source=match_graph_event:381, 0) 2016-11-04T19:32:56.840232+01:00 ventsi-clst1 crmd[6116]: warning: Action 49 (DRBD_global_clst_start_0) on ventsi-clst2-sync failed (target: 0 vs. rc: 1): Error 2016-11-04T19:32:56.840328+01:00 ventsi-clst1 crmd[6116]: notice: Transition 202 (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=5, Source=/var/lib/pacemaker/pengine/pe-input-766.bz2): Complete 2016-11-04T19:32:56.843693+01:00 ventsi-clst1 pengine[6115]: notice: On loss of CCM Quorum: Ignore 2016-11-04T19:32:56.844072+01:00 ventsi-clst1 pengine[6115]: warning: Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: unknown error (1) 2016-11-04T19:32:56.844102+01:00 ventsi-clst1 pengine[6115]: warning: Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: unknown error (1) 2016-11-04T19:32:56.845071+01:00 ventsi-clst1 pengine[6115]: notice: Start IPaddrNFS#011(ventsi-clst2-sync) 2016-11-04T19:32:56.845078+01:00 ventsi-clst1 pengine[6115]: notice: Start NFSServer#011(ventsi-clst2-sync) 2016-11-04T19:32:56.845081+01:00 ventsi-clst1 pengine[6115]: notice: Demote DRBD:0#011(Master -> Slave ventsi-clst1-sync) <=== here there would be the necessary demote/promote … but it’s too late; the start of the filesystem already failed … 2016-11-04T19:32:56.845083+01:00 ventsi-clst1 pengine[6115]: notice: Promote DRBD:1#011(Slave -> Master ventsi-clst2-sync) 2016-11-04T19:32:56.845084+01:00 ventsi-clst1 pengine[6115]: notice: Recover DRBD_global_clst#011(Started ventsi-clst2-sync) 2016-11-04T19:32:56.847986+01:00 ventsi-clst1 pengine[6115]: notice: Calculated Transition 203: /var/lib/pacemaker/pengine/pe-input-767.bz2 <=== … so the above transition gets caught by the following attempt to repair things partially 2016-11-04T19:32:56.867679+01:00 ventsi-clst1 pengine[6115]: notice: On loss of CCM Quorum: Ignore 2016-11-04T19:32:56.868074+01:00 ventsi-clst1 pengine[6115]: warning: Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: unknown error (1) 2016-11-04T19:32:56.868101+01:00 ventsi-clst1 pengine[6115]: warning: Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: unknown error (1) 2016-11-04T19:32:56.868287+01:00 ventsi-clst1 pengine[6115]: warning: Forcing DRBD_global_clst away from ventsi-clst2-sync after 1000000 failures (max=1000000) 2016-11-04T19:32:56.869011+01:00 ventsi-clst1 pengine[6115]: notice: Start IPaddrNFS#011(ventsi-clst1-sync) 2016-11-04T19:32:56.869023+01:00 ventsi-clst1 pengine[6115]: notice: Recover DRBD_global_clst#011(Started ventsi-clst2-sync -> ventsi-clst1-sync) 2016-11-04T19:32:56.869770+01:00 ventsi-clst1 pengine[6115]: notice: Calculated Transition 204: /var/lib/pacemaker/pengine/pe-input-768.bz2 2016-11-04T19:32:56.870065+01:00 ventsi-clst1 crmd[6116]: notice: Initiating action 3: stop DRBD_global_clst_stop_0 on ventsi-clst2-sync 2016-11-04T19:32:56.908075+01:00 ventsi-clst1 crmd[6116]: notice: Initiating action 42: start DRBD_global_clst_start_0 on ventsi-clst1-sync (local) 2016-11-04T19:32:56.931072+01:00 ventsi-clst1 Filesystem(DRBD_global_clst)[16242]: INFO: Running start for /dev/drbd1 on /drbdmnts/global_clst 2016-11-04T19:32:56.943250+01:00 ventsi-clst1 kernel: EXT4-fs (drbd1): warning: maximal mount count reached, running e2fsck is recommended 2016-11-04T19:32:56.953253+01:00 ventsi-clst1 kernel: EXT4-fs (drbd1): mounted filesystem with ordered data mode. Opts: 2016-11-04T19:32:56.964284+01:00 ventsi-clst1 crmd[6116]: notice: Operation DRBD_global_clst_start_0: ok (node=ventsi-clst1-sync, call=1225, rc=0, cib-update=1701, confirmed=true) 2016-11-04T19:32:56.965104+01:00 ventsi-clst1 crmd[6116]: notice: Initiating action 10: start IPaddrNFS_start_0 on ventsi-clst1-sync (local) 2016-11-04T19:32:56.965325+01:00 ventsi-clst1 crmd[6116]: notice: Initiating action 43: monitor DRBD_global_clst_monitor_20000 on ventsi-clst1-sync (local) 2016-11-04T19:32:56.996235+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16308]: INFO: Adding inet address xxx.xxx.xxx.xxx/24 with broadcast address xxx.xxx.xxx.255 to device bond0 2016-11-04T19:32:57.002059+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16308]: INFO: Bringing device bond0 up 2016-11-04T19:32:57.008128+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16308]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-xxx.xxx.xxx.xxx bond0 xxx.xxx.xxx.xxx auto not_used not_used 2016-11-04T19:32:57.020159+01:00 ventsi-clst1 crmd[6116]: notice: Operation IPaddrNFS_start_0: ok (node=ventsi-clst1-sync, call=1226, rc=0, cib-update=1703, confirmed=true) 2016-11-04T19:32:57.020901+01:00 ventsi-clst1 crmd[6116]: notice: Initiating action 11: monitor IPaddrNFS_monitor_5000 on ventsi-clst1-sync (local) 2016-11-04T19:32:57.052231+01:00 ventsi-clst1 crmd[6116]: notice: Transition 204 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-768.bz2): Complete 2016-11-04T19:32:57.052251+01:00 ventsi-clst1 crmd[6116]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] ================================================================== Any ideas what could be the reason for this behavior? And how could this be fixed? (I already found several articles on the internet with the recommendation to have two separately configured monitor operations for the DRBD resource configured one for the master role and another one for the slave role. Already tried this to no avail.) Regards Andi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20161106/0e18921a/attachment.htm> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ATT00001.txt URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20161106/0e18921a/attachment.txt>