<div dir="ltr">Raman,<br><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Mar 25, 2017 at 12:07 AM, Raman Gupta <span dir="ltr"><<a href="mailto:ramangupta16@gmail.com" target="_blank">ramangupta16@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi,</div><div><br></div><div>Thanks for looking into this issue. Here is my 'pcs status' and attached is cib.xml pacemaker file</div><div><br></div><div><div>[root@server4 cib]# pcs status</div><div>Cluster name: vCluster</div><div>Stack: corosync</div><div>Current DC: server7ha (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum</div><div>Last updated: Fri Mar 24 18:33:05 2017 Last change: Wed Mar 22 13:22:19 2017 by root via cibadmin on server7ha</div><div><br></div><div>2 nodes and 7 resources configured</div><div><br></div><div>Online: [ server4ha server7ha ]</div><div><br></div><div>Full list of resources:</div><div><br></div><div> vCluster-VirtualIP-10.168.10.<wbr>199 (ocf::heartbeat:IPaddr2): Started server7ha</div><div> vCluster-Stonith-server7ha (stonith:fence_ipmilan): Started server4ha</div><div> vCluster-Stonith-server4ha (stonith:fence_ipmilan): Started server7ha</div><div> Clone Set: dlm-clone [dlm]</div><div> Started: [ server4ha server7ha ]</div><div> Clone Set: clvmd-clone [clvmd]</div><div> Started: [ server4ha server7ha ]</div><div><br></div><div>Daemon Status:</div><div> corosync: active/disabled</div><div> pacemaker: active/disabled</div><div> pcsd: active/enabled</div><div>[root@server4 cib]# </div></div><div><br></div></div></blockquote><div><br></div><div>This shows us the problem: you have not configured any DRBD resource in Pacemaker hence it has no idea and control over it.</div><div><br></div><div>This is from one of my clusters:</div><div><br></div><div><div>Online: [ sl01 sl02 ]</div><div><br></div><div> p_fence_sl01<span class="gmail-Apple-tab-span" style="white-space:pre">        </span>(stonith:fence_ipmilan):<span class="gmail-Apple-tab-span" style="white-space:pre">        </span>Started sl02 </div><div> p_fence_sl02<span class="gmail-Apple-tab-span" style="white-space:pre">        </span>(stonith:fence_ipmilan):<span class="gmail-Apple-tab-span" style="white-space:pre">        </span>Started sl01 </div><div><b> Master/Slave Set: ms_drbd [p_drbd_r0]</b></div><div><b> Masters: [ sl01 sl02 ]</b></div><div> Clone Set: cl_dlm [p_controld]</div><div> Started: [ sl01 sl02 ]</div><div> Clone Set: cl_fs_gfs2 [p_fs_gfs2]</div><div> Started: [ sl01 sl02 ]</div></div><div> </div><div>You can notice the resources you are missing in bold, more specifically you have missed to configure DRBD and it's MS resource then colocation and contsraint resources too. So the "resource-and-stonith" hook in your drbd config will never work, Pacemaker does not know about any drbd resources.</div><div><br></div><div>This is from one of my production clusters, it's on Ubuntu so no PCS just CRM and I'm not using cLVM just DLM:</div><div><br></div><div><div>primitive p_controld ocf:pacemaker:controld \</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>op monitor interval="60" timeout="60" \</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>op start interval="0" timeout="90" \</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>op stop interval="0" timeout="100" \</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>params daemon="dlm_controld" \</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>meta target-role="Started"</div><div><b>primitive p_drbd_r0 ocf:linbit:drbd \</b></div><div><b><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>params drbd_resource="r0" adjust_master_score="0 10 1000 10000" \</b></div><div><b><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>op monitor interval="10" role="Master" \</b></div><div><b><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>op monitor interval="20" role="Slave" \</b></div><div><b><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>op start interval="0" timeout="240" \</b></div><div><b><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>op stop interval="0" timeout="100"</b></div><div><div><b>ms ms_drbd p_drbd_r0 \</b></div><div><b><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" interleave="true" target-role="Started"</b></div></div><div>primitive p_fs_gfs2 ocf:heartbeat:Filesystem \</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>params device="/dev/drbd0" directory="/data" fstype="gfs2" options="_netdev,noatime,rw,acl" \</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>op monitor interval="20" timeout="40" \</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>op start interval="0" timeout="60" \</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>op stop interval="0" timeout="60" \</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>meta is-managed="true"</div><div>clone cl_dlm p_controld \</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>meta globally-unique="false" interleave="true" target-role="Started"</div><div>clone cl_fs_gfs2 p_fs_gfs2 \</div><div><span class="gmail-Apple-tab-span" style="white-space:pre">        </span>meta globally-unique="false" interleave="true" ordered="true" target-role="Started"</div><div>colocation cl_fs_gfs2_dlm inf: cl_fs_gfs2 cl_dlm</div><div><b>colocation co_drbd_dlm inf: cl_dlm ms_drbd:Master</b></div><div>order o_dlm_fs_gfs2 inf: cl_dlm:start cl_fs_gfs2:start</div><div><b>order o_drbd_dlm_fs_gfs2 inf: ms_drbd:promote cl_dlm:start cl_fs_gfs2:start</b></div></div><div><br></div><div>I have excluded the fencing stuff for brevity and highlighted the resources you are missing. Check the rest though as well you might find something you can use or cross-check with your config.</div><div><br></div><div>Also thanks to Digimer about the very useful information (as always) he contributed explaining how the things actually work. </div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div></div></div><div class="gmail-HOEnZb"><div class="gmail-h5"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 24, 2017 at 1:49 PM, Raman Gupta <span dir="ltr"><<a href="mailto:ramangupta16@gmail.com" target="_blank">ramangupta16@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi All,<div><br></div><div><div>I am having a problem where if in GFS2 dual-Primary-DRBD Pacemaker Cluster, a node crashes then the running node hangs! The CLVM commands hang, the libvirt VM on running node hangs. </div><div><br></div><div>Env:</div><div>---------</div><div>CentOS 7.3</div><div>DRBD 8.4 </div><div>gfs2-utils-3.1.9-3.el7.x86_64<br></div><div>Pacemaker 1.1.15-11.el7_3.4<br></div><div>corosync-2.4.0-4.el7.x86_64<br></div><div><br></div><div><br></div><div>Infrastructure:</div><div>------------------------</div><div><div>1) Running A 2 node Pacemaker Cluster with proper fencing between the two. Nodes are server4 and server7.</div><div><br></div><div>2) Running DRBD dual-Primary and hosting GFS2 filesystem.</div><div><br></div><div>3) Pacemaker has DLM and cLVM resources configured among others.</div><div><br></div><div>4) A KVM/QEMU virtual machine is running on server4 which is holding the cluster resources.</div><div><br></div></div><div><br></div><div>Normal:</div><div>------------</div><div>5) In normal condition when the two nodes are completely UP then things are fine. The DRBD dual-primary works fine. The disk of VM is hosted on DRBD mount directory /backup and VM runs fine with Live Migration happily happening between the 2 nodes.</div><div><br></div><div><br></div><div>Problem:</div><div>----------------</div><div>6) Stop server7 [shutdown -h now] ---> LVM commands like pvdisplay hangs, VM runs only for 120s ---> After 120s DRBD/GFS2 panics (/var/log/messages below) in server4 and DRBD mount directory (/backup) becomes unavailable and VM hangs in server4. The DRBD though is fine on server4 and in Primary/Secondary mode in WFConnection state.<br></div><div><br></div><div>Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: invoked for vDrbd</div><div>Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: WARNING drbd-fencing could not determine the master id of drbd resource vDrbd</div><div>Mar 24 11:29:28 server4 kernel: drbd vDrbd: helper command: /sbin/drbdadm fence-peer vDrbd exit code 1 (0x100)</div><div>Mar 24 11:29:28 server4 kernel: drbd vDrbd: fence-peer helper broken, returned 1</div><div>Mar 24 11:32:01 server4 kernel: INFO: task kworker/8:1H:822 blocked for more than 120 seconds.</div><div>Mar 24 11:32:01 server4 kernel: "echo 0 > /proc/sys/kernel/hung_task_tim<wbr>eout_secs" disables this message.</div><div>Mar 24 11:32:01 server4 kernel: kworker/8:1H D ffff880473796c18 0 822 2 0x00000080</div><div>Mar 24 11:32:01 server4 kernel: Workqueue: glock_workqueue glock_work_func [gfs2]</div><div>Mar 24 11:32:01 server4 kernel: ffff88027674bb10 0000000000000046 ffff8802736e9f60 ffff88027674bfd8</div><div>Mar 24 11:32:01 server4 kernel: ffff88027674bfd8 ffff88027674bfd8 ffff8802736e9f60 ffff8804757ef808</div><div>Mar 24 11:32:01 server4 kernel: 0000000000000000 ffff8804757efa28 ffff8804757ef800 ffff880473796c18</div><div>Mar 24 11:32:01 server4 kernel: Call Trace:</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffff8168bbb9>] schedule+0x29/0x70</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffffa0714ce4>] drbd_make_request+0x2a4/0x380 [drbd]</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffff812e0000>] ? aes_decrypt+0x260/0xe10</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffff810b17d0>] ? wake_up_atomic_t+0x30/0x30</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffff812ee6f9>] generic_make_request+0x109/0x1<wbr>e0</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffff812ee841>] submit_bio+0x71/0x150</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffffa063ee11>] gfs2_meta_read+0x121/0x2a0 [gfs2]</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffffa063f392>] gfs2_meta_indirect_buffer+0x62<wbr>/0x150 [gfs2]</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffff810d2422>] ? load_balance+0x192/0x990</div><div><br></div><div>7) After server7 is UP, Pacemaker Cluster is started, DRBD started and Logical Volume activated and only after that in server4 the DRBD mount directory (/backup) becomes available and VM resumes in server4. So after server7 is down and till it is completely UP the VM in server4 hangs.</div></div><div><br></div><div><br></div><div>Can anyone help how to avoid running node hang when other node crashes?</div><div><br></div><div><br></div><div>Attaching DRBD config file.</div><span class="gmail-m_7105576696870626311HOEnZb"><font color="#888888"><div><br></div><div><br></div><div>--Raman</div><div><br></div></font></span></div>
</blockquote></div><br></div>
</div></div><br>______________________________<wbr>_________________<br>
drbd-user mailing list<br>
<a href="mailto:drbd-user@lists.linbit.com">drbd-user@lists.linbit.com</a><br>
<a href="http://lists.linbit.com/mailman/listinfo/drbd-user" rel="noreferrer" target="_blank">http://lists.linbit.com/<wbr>mailman/listinfo/drbd-user</a><br>
<br></blockquote></div><br></div></div>