Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, I'm back to learning pacemaker and DRBD 8.4. I've got a couple VMs with fence_xvm / fence_virtd working (I can crash a node with 'echo c > /proc/sysrq-trigger' and it reboots). However, when I start DRBD (not added to pacemaker, just running with 'handlers { fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; }' and 'disk { fencing resource-and-stonith; }') and crash the node, DRBD says the fence handler is broken and the resource is split-brained when it reboots. Below are all the details, but the main line, I think, is; ==== Sep 29 15:56:50 pcmk1 crm-fence-peer.sh[539]: invoked for r0 Sep 29 15:56:50 pcmk1 crm-fence-peer.sh[539]: WARNING drbd-fencing could not determine the master id of drbd resource r0 Sep 29 15:56:50 pcmk1 kernel: [ 2351.280209] d-con r0: helper command: /sbin/drbdadm fence-peer r0 exit code 1 (0x100) Sep 29 15:56:50 pcmk1 kernel: [ 2351.280213] d-con r0: fence-peer helper broken, returned 1 ==== Here are the extended details. Note that, for reasons that may be related, the fence action from pacemaker works the first time but fails the second time. However, if I start both nodes clean, start DRBD and crash the node, the fence works but the same DRBD fence error is shown. digimer [root at pcmk1 ~]# pcs config show ==== Cluster Name: an-pcmk-01 Corosync Nodes: pcmk1.alteeve.ca pcmk2.alteeve.ca Pacemaker Nodes: pcmk1.alteeve.ca pcmk2.alteeve.ca Resources: Location Constraints: Ordering Constraints: Colocation Constraints: Cluster Properties: cluster-infrastructure: corosync dc-version: 1.1.9-3.fc19-781a388 no-quorum-policy: ignore stonith-enabled: true ==== [root at pcmk1 ~]# pcs status ==== Cluster name: an-pcmk-01 Last updated: Sun Sep 29 15:26:15 2013 Last change: Sat Sep 28 15:30:12 2013 via cibadmin on pcmk1.alteeve.ca Stack: corosync Current DC: pcmk1.alteeve.ca (1) - partition with quorum Version: 1.1.9-3.fc19-781a388 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ pcmk1.alteeve.ca pcmk2.alteeve.ca ] Full list of resources: fence_pcmk1_xvm (stonith:fence_xvm): Started pcmk1.alteeve.ca fence_pcmk2_xvm (stonith:fence_xvm): Started pcmk2.alteeve.ca ==== DRBD is not running, crashing 'pcmk2'; ==== [root at pcmk2 ~]# echo c > /proc/sysrq-trigger ==== Node 1's system logs; ==== Sep 29 15:27:16 pcmk1 corosync[404]: [TOTEM ] A processor failed, forming new configuration. Sep 29 15:27:17 pcmk1 corosync[404]: [TOTEM ] A new membership (192.168.122.201:52) was formed. Members left: 2 Sep 29 15:27:17 pcmk1 crmd[422]: warning: match_down_event: No match for shutdown action on 2 Sep 29 15:27:17 pcmk1 crmd[422]: notice: peer_update_callback: Stonith/shutdown of pcmk2.alteeve.ca not matched Sep 29 15:27:17 pcmk1 crmd[422]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Sep 29 15:27:17 pcmk1 pengine[421]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 29 15:27:17 pcmk1 pengine[421]: warning: pe_fence_node: Node pcmk2.alteeve.ca will be fenced because our peer process is no longer available Sep 29 15:27:17 pcmk1 pengine[421]: warning: determine_online_status: Node pcmk2.alteeve.ca is unclean Sep 29 15:27:17 pcmk1 pengine[421]: warning: stage6: Scheduling Node pcmk2.alteeve.ca for STONITH Sep 29 15:27:17 pcmk1 pengine[421]: notice: LogActions: Move fence_pcmk2_xvm#011(Started pcmk2.alteeve.ca -> pcmk1.alteeve.ca) Sep 29 15:27:17 pcmk1 crmd[422]: notice: pcmk_quorum_notification: Membership 52: quorum lost (1) Sep 29 15:27:17 pcmk1 corosync[404]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Sep 29 15:27:17 pcmk1 corosync[404]: [QUORUM] Members[1]: 1 Sep 29 15:27:17 pcmk1 crmd[422]: notice: crm_update_peer_state: pcmk_quorum_notification: Node pcmk2.alteeve.ca[2] - state is now lost (was member) Sep 29 15:27:17 pcmk1 crmd[422]: warning: match_down_event: No match for shutdown action on 2 Sep 29 15:27:17 pcmk1 crmd[422]: notice: peer_update_callback: Stonith/shutdown of pcmk2.alteeve.ca not matched Sep 29 15:27:17 pcmk1 corosync[404]: [MAIN ] Completed service synchronization, ready to provide service. Sep 29 15:27:17 pcmk1 pengine[421]: warning: process_pe_message: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-warn-1.bz2 Sep 29 15:27:18 pcmk1 pengine[421]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 29 15:27:18 pcmk1 pengine[421]: warning: pe_fence_node: Node pcmk2.alteeve.ca will be fenced because the node is no longer part of the cluster Sep 29 15:27:18 pcmk1 pengine[421]: warning: determine_online_status: Node pcmk2.alteeve.ca is unclean Sep 29 15:27:18 pcmk1 pengine[421]: warning: custom_action: Action fence_pcmk2_xvm_stop_0 on pcmk2.alteeve.ca is unrunnable (offline) Sep 29 15:27:18 pcmk1 pengine[421]: warning: stage6: Scheduling Node pcmk2.alteeve.ca for STONITH Sep 29 15:27:18 pcmk1 pengine[421]: notice: LogActions: Move fence_pcmk2_xvm#011(Started pcmk2.alteeve.ca -> pcmk1.alteeve.ca) Sep 29 15:27:18 pcmk1 pengine[421]: warning: process_pe_message: Calculated Transition 2: /var/lib/pacemaker/pengine/pe-warn-2.bz2 Sep 29 15:27:18 pcmk1 crmd[422]: notice: te_fence_node: Executing reboot fencing operation (9) on pcmk2.alteeve.ca (timeout=60000) Sep 29 15:27:18 pcmk1 stonith-ng[418]: notice: handle_request: Client crmd.422.e5752159 wants to fence (reboot) 'pcmk2.alteeve.ca' with device '(any)' Sep 29 15:27:18 pcmk1 stonith-ng[418]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for pcmk2.alteeve.ca: d9e46c20-0f15-417c-bbb2-20b10eed8b4d (0) Sep 29 15:27:20 pcmk1 stonith-ng[418]: notice: log_operation: Operation 'reboot' [461] (call 2 from crmd.422) for host 'pcmk2.alteeve.ca' with device 'fence_pcmk2_xvm' returned: 0 (OK) Sep 29 15:27:20 pcmk1 stonith-ng[418]: notice: remote_op_done: Operation reboot of pcmk2.alteeve.ca by pcmk1.alteeve.ca for crmd.422 at pcmk1.alteeve.ca.d9e46c20: OK Sep 29 15:27:20 pcmk1 crmd[422]: notice: tengine_stonith_callback: Stonith operation 2/9:2:0:b91ef09d-b521-4e52-a3fb-bbcf1a0f6a37: OK (0) Sep 29 15:27:20 pcmk1 crmd[422]: notice: tengine_stonith_notify: Peer pcmk2.alteeve.ca was terminated (reboot) by pcmk1.alteeve.ca for pcmk1.alteeve.ca: OK (ref=d9e46c20-0f15-417c-bbb2-20b10eed8b4d) by client crmd.422 Sep 29 15:27:20 pcmk1 crmd[422]: notice: te_rsc_command: Initiating action 7: start fence_pcmk2_xvm_start_0 on pcmk1.alteeve.ca (local) Sep 29 15:27:20 pcmk1 stonith-ng[418]: notice: stonith_device_register: Added 'fence_pcmk2_xvm' to the device list (2 active devices) Sep 29 15:27:21 pcmk1 crmd[422]: notice: process_lrm_event: LRM operation fence_pcmk2_xvm_start_0 (call=16, rc=0, cib-update=43, confirmed=true) ok Sep 29 15:27:21 pcmk1 crmd[422]: notice: run_graph: Transition 2 (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Complete Sep 29 15:27:21 pcmk1 pengine[421]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 29 15:27:21 pcmk1 pengine[421]: notice: process_pe_message: Calculated Transition 3: /var/lib/pacemaker/pengine/pe-input-16.bz2 Sep 29 15:27:21 pcmk1 crmd[422]: notice: run_graph: Transition 3 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-16.bz2): Complete Sep 29 15:27:21 pcmk1 crmd[422]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] ==== [root at pcmk1 ~]# pcs status ==== Cluster name: an-pcmk-01 Last updated: Sun Sep 29 15:28:06 2013 Last change: Sat Sep 28 15:30:12 2013 via cibadmin on pcmk1.alteeve.ca Stack: corosync Current DC: pcmk1.alteeve.ca (1) - partition WITHOUT quorum Version: 1.1.9-3.fc19-781a388 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ pcmk1.alteeve.ca ] OFFLINE: [ pcmk2.alteeve.ca ] Full list of resources: fence_pcmk1_xvm (stonith:fence_xvm): Started pcmk1.alteeve.ca fence_pcmk2_xvm (stonith:fence_xvm): Started pcmk1.alteeve.ca ==== So it seems that fencing by pacemaker works. I re-add the pcmk2 node to the cluster and start DRBD; [root at pcmk1 ~]# pcs status ==== Cluster name: an-pcmk-01 Last updated: Sun Sep 29 15:32:29 2013 Last change: Sat Sep 28 15:30:12 2013 via cibadmin on pcmk1.alteeve.ca Stack: corosync Current DC: pcmk1.alteeve.ca (1) - partition with quorum Version: 1.1.9-3.fc19-781a388 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ pcmk1.alteeve.ca pcmk2.alteeve.ca ] Full list of resources: fence_pcmk1_xvm (stonith:fence_xvm): Started pcmk1.alteeve.ca fence_pcmk2_xvm (stonith:fence_xvm): Started pcmk2.alteeve.ca ==== Starting DRBD on pcmk1; ==== Sep 29 15:33:09 pcmk1 systemd[1]: Starting LSB: Control drbd resources.... Sep 29 15:33:09 pcmk1 kernel: [ 930.145896] events: mcg drbd: 3 Sep 29 15:33:09 pcmk1 kernel: [ 930.148651] drbd: initialized. Version: 8.4.3 (api:1/proto:86-101) Sep 29 15:33:09 pcmk1 kernel: [ 930.148657] drbd: srcversion: 5CF35A4122BF8D21CC12AE2 Sep 29 15:33:09 pcmk1 kernel: [ 930.148658] drbd: registered as block device major 147 Sep 29 15:33:09 pcmk1 drbd[492]: Starting DRBD resources: [ Sep 29 15:33:09 pcmk1 drbd[492]: create res: r0 Sep 29 15:33:09 pcmk1 drbd[492]: prepare disk: r0 Sep 29 15:33:09 pcmk1 kernel: [ 930.165573] d-con r0: Starting worker thread (from drbdsetup [506]) Sep 29 15:33:09 pcmk1 kernel: [ 930.165686] block drbd0: disk( Diskless -> Attaching ) Sep 29 15:33:09 pcmk1 kernel: [ 930.165757] d-con r0: Method to ensure write ordering: drain Sep 29 15:33:09 pcmk1 kernel: [ 930.165759] block drbd0: max BIO size = 1048576 Sep 29 15:33:09 pcmk1 kernel: [ 930.165762] block drbd0: drbd_bm_resize called with capacity == 25556136 Sep 29 15:33:09 pcmk1 kernel: [ 930.165801] block drbd0: resync bitmap: bits=3194517 words=49915 pages=98 Sep 29 15:33:09 pcmk1 kernel: [ 930.165803] block drbd0: size = 12 GB (12778068 KB) Sep 29 15:33:09 pcmk1 kernel: [ 930.166971] block drbd0: bitmap READ of 98 pages took 1 jiffies Sep 29 15:33:09 pcmk1 kernel: [ 930.167177] block drbd0: recounting of set bits took additional 1 jiffies Sep 29 15:33:09 pcmk1 kernel: [ 930.167179] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Sep 29 15:33:09 pcmk1 kernel: [ 930.167182] block drbd0: disk( Attaching -> Outdated ) Sep 29 15:33:09 pcmk1 kernel: [ 930.167184] block drbd0: attached to UUIDs A74FA07C42A73C34:0000000000000000:D307B78549F98F3B:D306B78549F98F3B Sep 29 15:33:09 pcmk1 drbd[492]: adjust disk: r0 Sep 29 15:33:09 pcmk1 drbd[492]: adjust net: r0 Sep 29 15:33:09 pcmk1 kernel: [ 930.169649] d-con r0: conn( StandAlone -> Unconnected ) Sep 29 15:33:09 pcmk1 kernel: [ 930.169661] d-con r0: Starting receiver thread (from drbd_w_r0 [507]) Sep 29 15:33:09 pcmk1 kernel: [ 930.169682] d-con r0: receiver (re)started Sep 29 15:33:09 pcmk1 kernel: [ 930.169689] d-con r0: conn( Unconnected -> WFConnection ) Sep 29 15:33:09 pcmk1 drbd[492]: ] Sep 29 15:33:09 pcmk1 kernel: [ 930.670525] d-con r0: Handshake successful: Agreed network protocol version 101 Sep 29 15:33:09 pcmk1 kernel: [ 930.670593] d-con r0: conn( WFConnection -> WFReportParams ) Sep 29 15:33:09 pcmk1 kernel: [ 930.670598] d-con r0: Starting asender thread (from drbd_r_r0 [510]) Sep 29 15:33:09 pcmk1 kernel: [ 930.680155] block drbd0: drbd_sync_handshake: Sep 29 15:33:09 pcmk1 kernel: [ 930.680164] block drbd0: self A74FA07C42A73C34:0000000000000000:D307B78549F98F3B:D306B78549F98F3B bits:0 flags:0 Sep 29 15:33:09 pcmk1 kernel: [ 930.680171] block drbd0: peer BFBB0CC0CCCEA510:A74FA07C42A73C35:D307B78549F98F3B:D306B78549F98F3B bits:0 flags:0 Sep 29 15:33:09 pcmk1 kernel: [ 930.680178] block drbd0: uuid_compare()=-1 by rule 50 Sep 29 15:33:09 pcmk1 kernel: [ 930.680189] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Sep 29 15:33:09 pcmk1 kernel: [ 930.680609] block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% Sep 29 15:33:09 pcmk1 kernel: [ 930.680704] block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% Sep 29 15:33:09 pcmk1 kernel: [ 930.680710] block drbd0: conn( WFBitMapT -> WFSyncUUID ) Sep 29 15:33:09 pcmk1 kernel: [ 930.682605] block drbd0: updated sync uuid A750A07C42A73C34:0000000000000000:D307B78549F98F3B:D306B78549F98F3B Sep 29 15:33:09 pcmk1 kernel: [ 930.682727] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 Sep 29 15:33:09 pcmk1 kernel: [ 930.683957] block drbd0: role( Secondary -> Primary ) Sep 29 15:33:09 pcmk1 kernel: [ 930.684148] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) Sep 29 15:33:09 pcmk1 kernel: [ 930.684761] block drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) Sep 29 15:33:09 pcmk1 kernel: [ 930.684780] block drbd0: Began resync as SyncTarget (will sync 0 KB [0 bits set]). Sep 29 15:33:09 pcmk1 kernel: [ 930.684873] block drbd0: peer( Secondary -> Primary ) Sep 29 15:33:09 pcmk1 drbd[492]: WARN: stdin/stdout is not a TTY; using /dev/console. Sep 29 15:33:09 pcmk1 systemd[1]: Started LSB: Control drbd resources.. Sep 29 15:33:09 pcmk1 kernel: [ 930.685167] block drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec) Sep 29 15:33:09 pcmk1 kernel: [ 930.685171] block drbd0: updated UUIDs BFBB0CC0CCCEA511:0000000000000000:A750A07C42A73C35:A74FA07C42A73C35 Sep 29 15:33:09 pcmk1 kernel: [ 930.685175] block drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Sep 29 15:33:09 pcmk1 kernel: [ 930.686606] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 Sep 29 15:33:09 pcmk1 kernel: [ 930.688369] block drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0) ==== Starting DRBD on pcmk2; ==== Sep 29 15:33:09 pcmk2 systemd[1]: Starting LSB: Control drbd resources.... Sep 29 15:33:09 pcmk2 kernel: [ 342.307173] events: mcg drbd: 3 Sep 29 15:33:09 pcmk2 kernel: [ 342.309495] drbd: initialized. Version: 8.4.3 (api:1/proto:86-101) Sep 29 15:33:09 pcmk2 kernel: [ 342.309497] drbd: srcversion: 5CF35A4122BF8D21CC12AE2 Sep 29 15:33:09 pcmk2 kernel: [ 342.309498] drbd: registered as block device major 147 Sep 29 15:33:09 pcmk2 drbd[433]: Starting DRBD resources: [ Sep 29 15:33:09 pcmk2 drbd[433]: create res: r0 Sep 29 15:33:09 pcmk2 drbd[433]: prepare disk: r0 Sep 29 15:33:09 pcmk2 kernel: [ 342.325521] d-con r0: Starting worker thread (from drbdsetup [447]) Sep 29 15:33:09 pcmk2 kernel: [ 342.325646] block drbd0: disk( Diskless -> Attaching ) Sep 29 15:33:09 pcmk2 kernel: [ 342.325722] d-con r0: Method to ensure write ordering: drain Sep 29 15:33:09 pcmk2 kernel: [ 342.325724] block drbd0: max BIO size = 1048576 Sep 29 15:33:09 pcmk2 kernel: [ 342.325728] block drbd0: drbd_bm_resize called with capacity == 25556136 Sep 29 15:33:09 pcmk2 kernel: [ 342.325770] block drbd0: resync bitmap: bits=3194517 words=49915 pages=98 Sep 29 15:33:09 pcmk2 kernel: [ 342.325772] block drbd0: size = 12 GB (12778068 KB) Sep 29 15:33:09 pcmk2 kernel: [ 342.326865] block drbd0: bitmap READ of 98 pages took 1 jiffies Sep 29 15:33:09 pcmk2 kernel: [ 342.326916] block drbd0: recounting of set bits took additional 0 jiffies Sep 29 15:33:09 pcmk2 kernel: [ 342.326917] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Sep 29 15:33:09 pcmk2 kernel: [ 342.326921] block drbd0: disk( Attaching -> UpToDate ) pdsk( DUnknown -> Outdated ) Sep 29 15:33:09 pcmk2 kernel: [ 342.326923] block drbd0: attached to UUIDs BFBB0CC0CCCEA511:A74FA07C42A73C35:D307B78549F98F3B:D306B78549F98F3B Sep 29 15:33:09 pcmk2 drbd[433]: adjust disk: r0 Sep 29 15:33:09 pcmk2 kernel: [ 342.328489] d-con r0: conn( StandAlone -> Unconnected ) Sep 29 15:33:09 pcmk2 kernel: [ 342.328502] d-con r0: Starting receiver thread (from drbd_w_r0 [448]) Sep 29 15:33:09 pcmk2 kernel: [ 342.328567] d-con r0: receiver (re)started Sep 29 15:33:09 pcmk2 kernel: [ 342.328574] d-con r0: conn( Unconnected -> WFConnection ) Sep 29 15:33:09 pcmk2 drbd[433]: adjust net: r0 Sep 29 15:33:09 pcmk2 drbd[433]: ] Sep 29 15:33:10 pcmk2 kernel: [ 343.731229] d-con r0: Handshake successful: Agreed network protocol version 101 Sep 29 15:33:10 pcmk2 kernel: [ 343.731268] d-con r0: conn( WFConnection -> WFReportParams ) Sep 29 15:33:10 pcmk2 kernel: [ 343.731271] d-con r0: Starting asender thread (from drbd_r_r0 [451]) Sep 29 15:33:10 pcmk2 kernel: [ 343.741130] block drbd0: drbd_sync_handshake: Sep 29 15:33:10 pcmk2 kernel: [ 343.741140] block drbd0: self BFBB0CC0CCCEA510:A74FA07C42A73C35:D307B78549F98F3B:D306B78549F98F3B bits:0 flags:0 Sep 29 15:33:10 pcmk2 kernel: [ 343.741146] block drbd0: peer A74FA07C42A73C34:0000000000000000:D307B78549F98F3B:D306B78549F98F3B bits:0 flags:0 Sep 29 15:33:10 pcmk2 kernel: [ 343.741152] block drbd0: uuid_compare()=1 by rule 70 Sep 29 15:33:10 pcmk2 kernel: [ 343.741163] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent ) Sep 29 15:33:10 pcmk2 kernel: [ 343.741315] block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% Sep 29 15:33:10 pcmk2 kernel: [ 343.741845] block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% Sep 29 15:33:10 pcmk2 kernel: [ 343.741852] block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 Sep 29 15:33:10 pcmk2 kernel: [ 343.743229] block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) Sep 29 15:33:10 pcmk2 kernel: [ 343.743245] block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) Sep 29 15:33:10 pcmk2 kernel: [ 343.743253] block drbd0: Began resync as SyncSource (will sync 0 KB [0 bits set]). Sep 29 15:33:10 pcmk2 kernel: [ 343.743288] block drbd0: updated sync UUID BFBB0CC0CCCEA510:A750A07C42A73C35:A74FA07C42A73C35:D307B78549F98F3B Sep 29 15:33:10 pcmk2 kernel: [ 343.744829] block drbd0: peer( Secondary -> Primary ) Sep 29 15:33:10 pcmk2 kernel: [ 343.745923] block drbd0: role( Secondary -> Primary ) Sep 29 15:33:10 pcmk2 drbd[433]: WARN: stdin/stdout is not a TTY; using /dev/console. Sep 29 15:33:10 pcmk2 systemd[1]: Started LSB: Control drbd resources.. Sep 29 15:33:10 pcmk2 kernel: [ 343.750704] block drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec) Sep 29 15:33:10 pcmk2 kernel: [ 343.750710] block drbd0: updated UUIDs BFBB0CC0CCCEA511:0000000000000000:A750A07C42A73C35:A74FA07C42A73C35 Sep 29 15:33:10 pcmk2 kernel: [ 343.750716] block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) ==== Showing /proc/drbd; [root at pcmk1 ~]# cat /proc/drbd ==== version: 8.4.3 (api:1/proto:86-101) srcversion: 5CF35A4122BF8D21CC12AE2 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:0 dw:0 dr:152 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0 ==== Now when I crash pcmk2 again, it gets fenced but DRBD thinks the fence fails. ==== Sep 29 15:56:50 pcmk1 corosync[404]: [TOTEM ] A processor failed, forming new configuration. Sep 29 15:56:50 pcmk1 kernel: [ 2351.189045] d-con r0: PingAck did not arrive in time. Sep 29 15:56:50 pcmk1 kernel: [ 2351.189086] d-con r0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) Sep 29 15:56:50 pcmk1 kernel: [ 2351.189251] d-con r0: asender terminated Sep 29 15:56:50 pcmk1 kernel: [ 2351.189253] d-con r0: Terminating drbd_a_r0 Sep 29 15:56:50 pcmk1 kernel: [ 2351.189317] d-con r0: Connection closed Sep 29 15:56:50 pcmk1 kernel: [ 2351.189499] d-con r0: conn( NetworkFailure -> Unconnected ) Sep 29 15:56:50 pcmk1 kernel: [ 2351.189501] d-con r0: receiver terminated Sep 29 15:56:50 pcmk1 kernel: [ 2351.189502] d-con r0: Restarting receiver thread Sep 29 15:56:50 pcmk1 kernel: [ 2351.189503] d-con r0: receiver (re)started Sep 29 15:56:50 pcmk1 kernel: [ 2351.189508] d-con r0: conn( Unconnected -> WFConnection ) Sep 29 15:56:50 pcmk1 kernel: [ 2351.189523] d-con r0: helper command: /sbin/drbdadm fence-peer r0 Sep 29 15:56:50 pcmk1 crm-fence-peer.sh[539]: invoked for r0 Sep 29 15:56:50 pcmk1 crm-fence-peer.sh[539]: WARNING drbd-fencing could not determine the master id of drbd resource r0 Sep 29 15:56:50 pcmk1 kernel: [ 2351.280209] d-con r0: helper command: /sbin/drbdadm fence-peer r0 exit code 1 (0x100) Sep 29 15:56:50 pcmk1 kernel: [ 2351.280213] d-con r0: fence-peer helper broken, returned 1 Sep 29 15:56:51 pcmk1 corosync[404]: [TOTEM ] A new membership (192.168.122.201:60) was formed. Members left: 2 Sep 29 15:56:51 pcmk1 crmd[422]: warning: match_down_event: No match for shutdown action on 2 Sep 29 15:56:51 pcmk1 crmd[422]: notice: peer_update_callback: Stonith/shutdown of pcmk2.alteeve.ca not matched Sep 29 15:56:51 pcmk1 crmd[422]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Sep 29 15:56:51 pcmk1 corosync[404]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Sep 29 15:56:51 pcmk1 corosync[404]: [QUORUM] Members[1]: 1 Sep 29 15:56:51 pcmk1 pengine[421]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 29 15:56:51 pcmk1 pengine[421]: warning: pe_fence_node: Node pcmk2.alteeve.ca will be fenced because our peer process is no longer available Sep 29 15:56:51 pcmk1 pengine[421]: warning: determine_online_status: Node pcmk2.alteeve.ca is unclean Sep 29 15:56:51 pcmk1 pengine[421]: warning: stage6: Scheduling Node pcmk2.alteeve.ca for STONITH Sep 29 15:56:51 pcmk1 pengine[421]: notice: LogActions: Move fence_pcmk2_xvm#011(Started pcmk2.alteeve.ca -> pcmk1.alteeve.ca) Sep 29 15:56:51 pcmk1 pengine[421]: warning: process_pe_message: Calculated Transition 6: /var/lib/pacemaker/pengine/pe-warn-3.bz2 Sep 29 15:56:51 pcmk1 corosync[404]: [MAIN ] Completed service synchronization, ready to provide service. Sep 29 15:56:51 pcmk1 crmd[422]: notice: pcmk_quorum_notification: Membership 60: quorum lost (1) Sep 29 15:56:51 pcmk1 crmd[422]: notice: crm_update_peer_state: pcmk_quorum_notification: Node pcmk2.alteeve.ca[2] - state is now lost (was member) Sep 29 15:56:51 pcmk1 crmd[422]: warning: match_down_event: No match for shutdown action on 2 Sep 29 15:56:51 pcmk1 crmd[422]: notice: peer_update_callback: Stonith/shutdown of pcmk2.alteeve.ca not matched Sep 29 15:56:52 pcmk1 pengine[421]: notice: unpack_config: On loss of CCM Quorum: Ignore Sep 29 15:56:52 pcmk1 pengine[421]: warning: pe_fence_node: Node pcmk2.alteeve.ca will be fenced because the node is no longer part of the cluster Sep 29 15:56:52 pcmk1 pengine[421]: warning: determine_online_status: Node pcmk2.alteeve.ca is unclean Sep 29 15:56:52 pcmk1 pengine[421]: warning: custom_action: Action fence_pcmk2_xvm_stop_0 on pcmk2.alteeve.ca is unrunnable (offline) Sep 29 15:56:52 pcmk1 pengine[421]: warning: stage6: Scheduling Node pcmk2.alteeve.ca for STONITH Sep 29 15:56:52 pcmk1 pengine[421]: notice: LogActions: Move fence_pcmk2_xvm#011(Started pcmk2.alteeve.ca -> pcmk1.alteeve.ca) Sep 29 15:56:52 pcmk1 pengine[421]: warning: process_pe_message: Calculated Transition 7: /var/lib/pacemaker/pengine/pe-warn-4.bz2 Sep 29 15:56:52 pcmk1 crmd[422]: notice: te_fence_node: Executing reboot fencing operation (9) on pcmk2.alteeve.ca (timeout=60000) Sep 29 15:56:52 pcmk1 stonith-ng[418]: notice: handle_request: Client crmd.422.e5752159 wants to fence (reboot) 'pcmk2.alteeve.ca' with device '(any)' Sep 29 15:56:52 pcmk1 stonith-ng[418]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for pcmk2.alteeve.ca: f09f32c4-e726-464b-9065-56346f7c047c (0) Sep 29 15:56:52 pcmk1 stonith-ng[418]: error: remote_op_done: Operation reboot of pcmk2.alteeve.ca by pcmk1.alteeve.ca for crmd.422 at pcmk1.alteeve.ca.f09f32c4: No such device Sep 29 15:56:52 pcmk1 crmd[422]: notice: tengine_stonith_callback: Stonith operation 3/9:7:0:b91ef09d-b521-4e52-a3fb-bbcf1a0f6a37: No such device (-19) Sep 29 15:56:52 pcmk1 crmd[422]: notice: tengine_stonith_callback: Stonith operation 3 for pcmk2.alteeve.ca failed (No such device): aborting transition. Sep 29 15:56:52 pcmk1 crmd[422]: notice: tengine_stonith_notify: Peer pcmk2.alteeve.ca was not terminated (reboot) by pcmk1.alteeve.ca for pcmk1.alteeve.ca: No such device (ref=f09f32c4-e726-464b-9065-56346f7c047c) by client crmd.422 Sep 29 15:56:52 pcmk1 crmd[422]: notice: run_graph: Transition 7 (Complete=1, Pending=0, Fired=0, Skipped=4, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-4.bz2): Stopped Sep 29 15:56:52 pcmk1 crmd[422]: notice: too_many_st_failures: No devices found in cluster to fence pcmk2.alteeve.ca, giving up Sep 29 15:56:52 pcmk1 crmd[422]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] ==== -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?