Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, In a Pacemaker 2 node cluster with dual-Primary DRBD(drbd84) with GFS2/DLM/CLVM setup following issue happens: Steps: --------- 1) Successfully created Pacemaker 2 node cluster with DRBD master/slave resources integrated. 2) Cluster nodes: server4 and server7 3) The server4 node is rebooted. 4) When server4 comes Up the server7 is stonith'd and is lost! The node server4 survives. Problem: ----------- Problem is #4 above, when server4 comes up why server7 is stonith'd? >From surviving node server4 the DRBD logs seems to be OK: DRBD has moved to Connected/UpToDate state. Suddenly server7 is rebooted (stonithd/fenced) between time 00:47:35 <--> 00:47:42 in below logs. /var/log/messages at server4 ------------------------------------------------ May 10 00:47:35 server4 kernel: block drbd0: updated sync uuid 0594324E7C28AFF8:0000000000000000:D5926E3E7F02ED2F:0000000000000004 May 10 00:47:35 server4 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 May 10 00:47:35 server4 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) May 10 00:47:35 server4 kernel: block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) May 10 00:47:35 server4 kernel: block drbd0: Began resync as SyncTarget (will sync 0 KB [0 bits set]). May 10 00:47:35 server4 kernel: block drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec) May 10 00:47:35 server4 kernel: block drbd0: updated UUIDs DB4640C6B3831C4E:0000000000000000:0594324E7C28AFF8:0593324E7C28AFF9 May 10 00:47:35 server4 kernel: block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) May 10 00:47:35 server4 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 May 10 00:47:35 server4 crm-unfence-peer.sh[12985]: invoked for vDrbd May 10 00:47:35 server4 crm-unfence-peer.sh[12985]: No constraint in place, nothing to do. May 10 00:47:35 server4 kernel: block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0) May 10 00:47:35 server4 crmd[12597]: notice: Result of start operation for dlm on server4ha: 0 (ok) May 10 00:47:35 server4 stonith-ng[12593]: notice: vCluster-Stonith-server7ha can fence (reboot) server7ha: static-list May 10 00:47:35 server4 stonith-ng[12593]: notice: vCluster-Stonith-server7ha can fence (reboot) server7ha: static-list May 10 00:47:35 server4 crmd[12597]: notice: Result of notify operation for drbd_data on server4ha: 0 (ok) May 10 00:47:41 server4 kernel: tg3 0000:02:00.1 em4: Link is down May 10 00:47:42 server4 kernel: tg3 0000:02:00.0 em3: Link is down May 10 00:47:42 server4 corosync[12570]: [TOTEM ] A processor failed, forming new configuration. May 10 00:47:43 server4 stonith-ng[12593]: notice: Operation 'reboot' [13018] (call 2 from crmd.13562) for host 'server7ha' with device 'vCluster-Stonith-server7ha' returned: 0 (OK) May 10 00:47:43 server4 corosync[12570]: [TOTEM ] A new membership ( 192.168.11.100:68) was formed. Members left: 2 May 10 00:47:43 server4 corosync[12570]: [TOTEM ] Failed to receive the leave message. failed: 2 May 10 00:47:43 server4 attrd[12595]: notice: Node server7ha state is now lost May 10 00:47:43 server4 attrd[12595]: notice: Removing all server7ha attributes for peer loss CorosyncLogs at server4: ------------------------------------------ May 10 00:47:35 [12592] server4 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=server4ha/crmd/23, version=0.34.68) May 10 00:47:35 [12597] server4 crmd: info: crmd_notify_complete: Alert 8 (/usr/lib64/vPacemaker/alert_snmp.sh) complete May 10 00:47:35 [12597] server4 crmd: info: do_lrm_rsc_op: Performing key=15:2:0:377224d5-7e3c-4e55-91ef-3bd5e00ab71e op=dlm_monitor_60000 May 10 00:47:35 [12597] server4 crmd: info: do_lrm_rsc_op: Performing key=69:2:0:377224d5-7e3c-4e55-91ef-3bd5e00ab71e op=drbd_data_notify_0 May 10 00:47:35 [12593] server4 stonith-ng: notice: can_fence_host_with_device: vCluster-Stonith-server7ha can fence (reboot) server7ha: static-list May 10 00:47:35 [12594] server4 lrmd: info: log_execute: executing - rsc:drbd_data action:notify call_id:36 May 10 00:47:35 [12593] server4 stonith-ng: notice: can_fence_host_with_device: vCluster-Stonith-server7ha can fence (reboot) server7ha: static-list May 10 00:47:35 [12593] server4 stonith-ng: info: stonith_fence_get_devices_cb: Found 1 matching devices for 'server7ha' May 10 00:47:35 [12597] server4 crmd: info: process_lrm_event: Result of monitor operation for dlm on server4ha: 0 (ok) | call=35 key=dlm_monitor_60000 confirmed=false cib-update=24 May 10 00:47:35 [12592] server4 cib: info: cib_process_request: Forwarding cib_modify operation for section status to all (origin=local/crmd/24) May 10 00:47:35 [12592] server4 cib: info: cib_perform_op: Diff: --- 0.34.68 2 May 10 00:47:35 [12592] server4 cib: info: cib_perform_op: Diff: +++ 0.34.69 (null) May 10 00:47:35 [12592] server4 cib: info: cib_perform_op: + /cib: @num_updates=69 May 10 00:47:35 [12592] server4 cib: info: cib_perform_op: ++ /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='dlm']: <lrm_rsc_op id="dlm_monitor_60000" operation_key="dlm_monitor_60000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="15:2:0:377224d5-7e3c-4e55-91ef-3bd5e00ab71e" transition-magic="0:0;15:2:0:377224d5-7e3c-4e55-91ef-3bd5e00ab71e" on_node="server4ha" call-id="35" rc-code="0" op-status="0" interval="6000 May 10 00:47:35 [12592] server4 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=server4ha/crmd/24, version=0.34.69) May 10 00:47:35 [12597] server4 crmd: info: crmd_notify_complete: Alert 9 (/usr/lib64/vPacemaker/alert_snmp.sh) complete May 10 00:47:35 [12594] server4 lrmd: info: log_finished: finished - rsc:drbd_data action:notify call_id:36 pid:13017 exit-code:0 exec-time:39ms queue-time:0ms May 10 00:47:35 [12597] server4 crmd: notice: process_lrm_event: Result of notify operation for drbd_data on server4ha: 0 (ok) | call=36 key=drbd_data_notify_0 confirmed=true cib-update=0 May 10 00:47:35 [12597] server4 crmd: info: crmd_notify_complete: Alert 10 (/usr/lib64/vPacemaker/alert_snmp.sh) complete May 10 00:47:36 [12597] server4 crmd: info: crm_update_peer_expected: handle_request: Node server7ha[2] - expected state is now down (was member) Pacemaker Status before reboot: ------------------------------------------------------- [root at server4 ~]# pcs status Cluster name: vCluster Stack: corosync Current DC: server4ha (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum Last updated: Tue May 9 23:40:36 2017 Last change: Tue May 9 18:28:32 2017 by root via cibadmin on server4ha 2 nodes and 9 resources configured Online: [ server4ha server7ha ] Full list of resources: vCluster-VirtualIP-10.168.10.199 (ocf::heartbeat:IPaddr2): Started server4ha vCluster-Stonith-server4ha (stonith:fence_ipmilan): Started server7ha vCluster-Stonith-server7ha (stonith:fence_ipmilan): Started server4ha Clone Set: dlm-clone [dlm] Started: [ server4ha server7ha ] Clone Set: clvmd-clone [clvmd] Started: [ server4ha server7ha ] Master/Slave Set: drbd_data_clone [drbd_data] Masters: [ server4ha server7ha ] Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled Env: --------- CentOS 7.3 kmod-drbd84-8.4.9-1.el7.elrepo.x86_64 drbd84-utils-8.9.8-1.el7.elrepo.x86_64 pacemaker-cluster-libs-1.1.15-11.el7_3.4.x86_64 pacemaker-1.1.15-11.el7_3.4.x86_64 corosync-2.4.0-4.el7.x86_64 pcs-0.9.152-10.el7.centos.3.x86_64 gfs2-utils-3.1.9-3.el7.x86_64 lvm2-cluster-2.02.166-1.el7_3.4.x86_64 Attaching resource files. Thanks, Raman -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170510/e3a332de/attachment.htm> -------------- next part -------------- A non-text attachment was scrubbed... Name: vDrbd.res Type: application/octet-stream Size: 668 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170510/e3a332de/attachment.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: global_common.conf Type: application/octet-stream Size: 367 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170510/e3a332de/attachment-0001.obj>