<div dir="ltr">> <span style="font-size:12.8px">I don't think this has anything to do with DRBD, because:</span><br style="font-size:12.8px"><div><span style="font-size:12.8px">OK. </span></div><div><br></div><div>> <span style="font-size:12.8px">Apparently, something downed the NICs for corosync communication.</span></div><span style="font-size:12.8px">> Which then leads to fencing.</span><br style="font-size:12.8px"><span style="font-size:12.8px">No problem with NICs. </span><div><span style="font-size:12.8px"></span><br style="font-size:12.8px"><span style="font-size:12.8px">> Maybe you should double check your network configuration,</span><br style="font-size:12.8px"><span style="font-size:12.8px">> and any automagic reconfiguration of the network,</span><br style="font-size:12.8px"><span style="font-size:12.8px">> and only start corosync once your network is "stable"?</span><div><div><span style="font-size:12.8px">As another manifestation of similar problem of dual-Primary DRBD integrated with stonith enabled Pacemaker: When server7 goes down, </span><span style="font-size:12.8px">the DRBD resource on </span><span style="font-size:12.8px">surviving node server4 is attempted to be demoted as secondary. The demotion fails because DRBD is hosting a GFS2 volume and Pacemaker </span><span style="font-size:12.8px">complains of this failure as an error. When server7 comes back UP again </span><span style="font-size:12.8px">server4 is </span><span style="font-size:12.8px">stonith'd probably because of this error reported by pacemaker. </span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Thus problem I am facing is that when either node crashes:</span></div><div><span style="font-size:12.8px">) Surviving node behaves strangely like in this case DRBD is attempted to be demoted and failure to demote is flagged as Pacemaker error.</span></div><div><span style="font-size:12.8px">) When crashed nodes comes back UP then surviving node is stonith'd. </span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Logs on surviving server4 node when server7 went down:</span></div><div><span style="font-size:12.8px">---------------------------------------------------------</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 crmd[4032]: notice: State transition S_IDLE -> S_POLICY_ENGINE</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: drbd vDrbd: helper command: /sbin/drbdadm fence-peer vDrbd exit code 5 (0x500)</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: drbd vDrbd: fence-peer helper returned 5 (peer is unreachable, assumed to be dead)</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: drbd vDrbd: pdsk( DUnknown -> Outdated )</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: block drbd0: new current UUID 8D5135C7BB88B6BF:CB0611075C849723:FAA83EE5FF5969E7:0000000000000004</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: drbd vDrbd: susp( 1 -> 0 )</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Looking at journal...</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 crm-fence-peer.sh[17468]: INFO peer is not reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-vDrbd-drbd_data_clone'</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Acquiring the transaction lock...</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Replaying journal...</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Replayed 4 of 13 blocks</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Found 4 revoke tags</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Journal replayed in 1s</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: recover jid 1 result success</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: jid=1: Done</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: GFS2: fsid=vCluster:vGFS2.0: recover generation 3 done</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 pengine[4031]: notice: On loss of CCM Quorum: Ignore</span></div><div><span style="font-size:12.8px"><b>May 11 13:54:55 server4 pengine[4031]: notice: Demote drbd_data:0#011(Master -> Slave server4ha)</b></span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 pengine[4031]: notice: Calculated transition 2, saving inputs in /var/lib/pacemaker/pengine/pe-input-39.bz2</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 crmd[4032]: notice: Initiating notify operation drbd_data_pre_notify_demote_0 locally on server4ha</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 crmd[4032]: notice: Result of notify operation for drbd_data on server4ha: 0 (ok)</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 crmd[4032]: notice: Initiating demote operation drbd_data_demote_0 locally on server4ha</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: block drbd0: State change failed: Device is held open by someone</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: block drbd0: state = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated r----- }</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: block drbd0: wanted = { cs:WFConnection ro:Secondary/Unknown ds:UpToDate/Outdated r----- }</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 drbd(drbd_data)[18625]: ERROR: vDrbd: Called drbdadm -c /etc/drbd.conf secondary vDrbd</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 drbd(drbd_data)[18625]: ERROR: vDrbd: Exit code 11</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 drbd(drbd_data)[18625]: ERROR: vDrbd: Command output:</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 drbd(drbd_data)[18625]: ERROR: vDrbd: Called drbdadm -c /etc/drbd.conf secondary vDrbd</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 drbd(drbd_data)[18625]: ERROR: vDrbd: Exit code 11</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 drbd(drbd_data)[18625]: ERROR: vDrbd: Command output:</span></div><div><span style="font-size:12.8px">May 11 13:54:55 server4 kernel: block drbd0: State change failed: Device is held open by someone</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 lrmd[4029]: notice: drbd_data_stop_0:21199:stderr [ Command 'drbdsetup-84 secondary 0' terminated with exit code 11 ]</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 lrmd[4029]: notice: drbd_data_stop_0:21199:stderr [ 0: State change failed: (-12) Device is held open by someone ]</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 lrmd[4029]: notice: drbd_data_stop_0:21199:stderr [ Command 'drbdsetup-84 secondary 0' terminated with exit code 11 ]</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 crmd[4032]: error: Result of stop operation for drbd_data on server4ha: Timed Out</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 crmd[4032]: notice: server4ha-drbd_data_stop_0:42 [ 0: State change failed: (-12) Device is held open by someone\nCommand 'drbdsetup-84 secondary 0' terminated with exit code 11\n0: State change failed: (-12) Device is held open by someone\nCommand 'drbdsetup-84 secondary 0' terminated with exit code 11\n0: State change failed: (-12) Device is held open by someone\nCommand 'drbdsetup-84 secondary 0' terminated with exit code 11\n0: State change failed: (-12) Device is held open by someone\nCommand 'drbdsetup-84 seco</span></div><div><span style="font-size:12.8px"><b>May 11 13:58:06 server4 crmd[4032]: warning: Action 5 (drbd_data_stop_0) on server4ha failed (target: 0 vs. rc: 1): Error</b></span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 crmd[4032]: notice: Transition aborted by operation drbd_data_stop_0 'modify' on server4ha: Event failed</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 crmd[4032]: warning: Action 5 (drbd_data_stop_0) on server4ha failed (target: 0 vs. rc: 1): Error</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 crmd[4032]: notice: Transition 3 (Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-40.bz2): Complete</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: notice: On loss of CCM Quorum: Ignore</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: warning: Processing failed op stop for drbd_data:0 on server4ha: unknown error (1)</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: warning: Processing failed op stop for drbd_data:0 on server4ha: unknown error (1)</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: warning: Node server4ha will be fenced because of resource failure(s)</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: warning: Forcing drbd_data_clone away from server4ha after 1000000 failures (max=1)</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: warning: Forcing drbd_data_clone away from server4ha after 1000000 failures (max=1)</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: warning: Scheduling Node server4ha for STONITH</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: notice: Stop of failed resource drbd_data:0 is implicit after server4ha is fenced</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: notice: Stop vCluster-VirtualIP-10.168.10.199#011(server4ha)</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: notice: Stop vCluster-Stonith-server7ha#011(server4ha)</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: notice: Stop dlm:0#011(server4ha)</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: notice: Stop clvmd:0#011(server4ha)</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: notice: Stop drbd_data:0#011(server4ha)</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 pengine[4031]: warning: Calculated transition 4 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-3.bz2</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 crmd[4032]: notice: Requesting fencing (reboot) of node server4ha</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 stonith-ng[4028]: notice: Client crmd.4032.582f53c4 wants to fence (reboot) 'server4ha' with device '(any)'</span></div><div><span style="font-size:12.8px">May 11 13:58:06 server4 stonith-ng[4028]: notice: Requesting peer fencing (reboot) of server4ha</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">May 11 14:13:06 server4 stonith-ng[4028]: notice: Client crmd.4032.582f53c4 wants to fence (reboot) 'server4ha' with device '(any)'</span></div><div><span style="font-size:12.8px">May 11 14:13:06 server4 stonith-ng[4028]: notice: Requesting peer fencing (reboot) of server4ha</span></div><div><span style="font-size:12.8px">May 11 14:13:06 server4 stonith-ng[4028]: notice: vCluster-Stonith-server7ha can not fence (reboot) server4ha: static-list</span></div><div><span style="font-size:12.8px">May 11 14:13:06 server4 stonith-ng[4028]: notice: Couldn't find anyone to fence (reboot) server4ha with any device</span></div><div><span style="font-size:12.8px">May 11 14:13:06 server4 stonith-ng[4028]: error: Operation reboot of server4ha by <no-one> for crmd.4032@server4ha.09ace2fb: No such device</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">After server7 comes up the server4 requests itself to be stonith'd:</span></div><div><span style="font-size:12.8px">--------------------------------------------------------------------------------------------------------</span></div><div><span style="font-size:12.8px">May 11 14:18:40 server4 corosync[4010]: [TOTEM ] A new membership (<a href="http://192.168.11.100:84">192.168.11.100:84</a>) was formed. Members joined: 2</span></div><div><span style="font-size:12.8px">May 11 14:18:40 server4 corosync[4010]: [QUORUM] Members[2]: 1 2</span></div><div><span style="font-size:12.8px">May 11 14:18:40 server4 corosync[4010]: [MAIN ] Completed service synchronization, ready to provide service.</span></div><div><span style="font-size:12.8px">May 11 14:18:40 server4 crmd[4032]: notice: Node server7ha state is now member</span></div><div><span style="font-size:12.8px">May 11 14:18:40 server4 pacemakerd[4026]: notice: Node server7ha state is now member</span></div><div><span style="font-size:12.8px">May 11 14:18:43 server4 crmd[4032]: notice: Syncing the Cluster Information Base from server7ha to rest of cluster</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 pengine[4031]: warning: Processing failed op stop for drbd_data:0 on server4ha: unknown error (1)</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 pengine[4031]: warning: Processing failed op stop for drbd_data:0 on server4ha: unknown error (1)</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 pengine[4031]: warning: Node server4ha will be fenced because of resource failure(s)</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 pengine[4031]: warning: Forcing drbd_data_clone away from server4ha after 1000000 failures (max=1)</span></div><div><span style="font-size:12.8px"><b>May 11 14:18:45 server4 pengine[4031]: warning: Scheduling Node server4ha for STONITH</b></span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 pengine[4031]: notice: Stop of failed resource drbd_data:0 is implicit after server4ha is fenced</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 pengine[4031]: notice: Move vCluster-VirtualIP-10.168.10.199#011(Started server4ha -> server7ha)</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 pengine[4031]: notice: Start vCluster-Stonith-server4ha#011(server7ha)</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 pengine[4031]: notice: Stop vCluster-Stonith-server7ha#011(server4ha)</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 pengine[4031]: notice: Move dlm:0#011(Started server4ha -> server7ha)</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 pengine[4031]: notice: Move clvmd:0#011(Started server4ha -> server7ha)</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 pengine[4031]: notice: Stop drbd_data:0#011(server4ha)</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 pengine[4031]: notice: Start drbd_data:1#011(server7ha)</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 crmd[4032]: notice: Initiating monitor operation vCluster-VirtualIP-10.168.10.199_monitor_0 on server7ha</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 crmd[4032]: notice: Initiating monitor operation vCluster-Stonith-server4ha_monitor_0 on server7ha</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 crmd[4032]: notice: Initiating monitor operation vCluster-Stonith-server7ha_monitor_0 on server7ha</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 crmd[4032]: notice: Initiating monitor operation dlm_monitor_0 on server7ha</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 crmd[4032]: notice: Initiating monitor operation clvmd_monitor_0 on server7ha</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 crmd[4032]: notice: Initiating monitor operation drbd_data:1_monitor_0 on server7ha</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 crmd[4032]: notice: Requesting fencing (reboot) of node server4ha</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 stonith-ng[4028]: notice: Client crmd.4032.582f53c4 wants to fence (reboot) 'server4ha' with device '(any)'</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 stonith-ng[4028]: notice: Requesting peer fencing (reboot) of server4ha</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 crmd[4032]: notice: Initiating start operation vCluster-Stonith-server4ha_start_0 on server7ha</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 systemd-logind: Power key pressed.</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 systemd-logind: Powering Off...</span></div><div><span style="font-size:12.8px">May 11 14:18:45 server4 systemd-logind: System is powering down.</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">I am unable to understand that when DRBD is in Primary on both nodes then why it is demoted on surviving node when other node goes down: Is it </span><span style="font-size:12.8px">to avoid split brain. But this demotion is causing issues because I want surviving node to remain Primary and not be demoted to secondary because </span><span style="font-size:12.8px">GFS2 over DRBD cluster volume hosts my VM.</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">DRBD Pacemaker integration commands used:</span></div><div><span style="font-size:12.8px">--------------------------------------------------------------------------</span></div><div><span style="font-size:12.8px">pcs -f drbd_cfg resource create drbd_data ocf:linbit:drbd drbd_resource=${DRBD_RESOURCE_NAME} op monitor interval=60s</span></div><div><span style="font-size:12.8px">pcs -f drbd_cfg resource master drbd_data_clone drbd_data master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 \</span></div><div><span style="font-size:12.8px"> notify=true interleave=true target-role=Started</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Behaviour I want:</span></div><div><span style="font-size:12.8px">-------------------------------------</span></div><div><span style="font-size:12.8px">After Successfully creating Pacemaker 2 node cluster with dual-Primary DRBD together with cLVM/DLM/GFS2:</span></div><div><span style="font-size:12.8px">1) If any one node is down the other node remains UP with no disruption to DRBD and other resources.</span></div><div><span style="font-size:12.8px">2) After same crashed node is UP again and joins back Cluster then it should seamlessly join with no disruption to any of resources.</span></div><div><br></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Any ideas with Pacemaker and/or DRBD configuration to achieve this will be helpful.</span></div></div></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Thanks,</span></div><div><span style="font-size:12.8px">Raman</span></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, May 10, 2017 at 7:01 PM, Lars Ellenberg <span dir="ltr"><<a href="mailto:lars.ellenberg@linbit.com" target="_blank">lars.ellenberg@linbit.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Wed, May 10, 2017 at 02:07:45AM +0530, Raman Gupta wrote:<br>
> Hi,<br>
><br>
> In a Pacemaker 2 node cluster with dual-Primary DRBD(drbd84) with<br>
> GFS2/DLM/CLVM setup following issue happens:<br>
><br>
> Steps:<br>
> ---------<br>
> 1) Successfully created Pacemaker 2 node cluster with DRBD master/slave<br>
> resources integrated.<br>
> 2) Cluster nodes: server4 and server7<br>
> 3) The server4 node is rebooted.<br>
> 4) When server4 comes Up the server7 is stonith'd and is lost! The node<br>
> server4 survives.<br>
><br>
> Problem:<br>
> -----------<br>
> Problem is #4 above, when server4 comes up why server7 is stonith'd?<br>
><br>
> From surviving node server4 the DRBD logs seems to be OK: DRBD has moved to<br>
> Connected/UpToDate state. Suddenly server7 is rebooted (stonithd/fenced)<br>
> between time 00:47:35 <--> 00:47:42 in below logs.<br>
<br>
</span>I don't think this has anything to do with DRBD, because:<br>
<br>
><br>
> /var/log/messages@server4<br>
> ------------------------------<wbr>------------------<br>
<span class=""><br>
> May 10 00:47:41 server4 kernel: tg3 0000:02:00.1 em4: Link is down<br>
> May 10 00:47:42 server4 kernel: tg3 0000:02:00.0 em3: Link is down<br>
> May 10 00:47:42 server4 corosync[12570]: [TOTEM ] A processor failed,<br>
> forming new configuration.<br>
> May 10 00:47:43 server4 stonith-ng[12593]: notice: Operation 'reboot'<br>
> [13018] (call 2 from crmd.13562) for host 'server7ha' with device<br>
> 'vCluster-Stonith-server7ha' returned: 0 (OK)<br>
<br>
<br>
</span>There.<br>
<br>
Apparently, something downed the NICs for corosync communication.<br>
Which then leads to fencing.<br>
<br>
Maybe you should double check your network configuration,<br>
and any automagic reconfiguration of the network,<br>
and only start corosync once your network is "stable"?<br>
<span class="HOEnZb"><font color="#888888"><br>
<br>
--<br>
: Lars Ellenberg<br>
: LINBIT | Keeping the Digital World Running<br>
: DRBD -- Heartbeat -- Corosync -- Pacemaker<br>
<br>
DRBD® and LINBIT® are registered trademarks of LINBIT<br>
__<br>
please don't Cc me, but send to list -- I'm subscribed<br>
______________________________<wbr>_________________<br>
drbd-user mailing list<br>
<a href="mailto:drbd-user@lists.linbit.com">drbd-user@lists.linbit.com</a><br>
<a href="http://lists.linbit.com/mailman/listinfo/drbd-user" rel="noreferrer" target="_blank">http://lists.linbit.com/<wbr>mailman/listinfo/drbd-user</a><br>
</font></span></blockquote></div><br></div>