<div dir="ltr">Hi,<div><br></div><div>I was able to integrate DRBD with Pacemaker and my problem was solved. After this no hang was observed in running after other node was shutdown. DRBD was integrated as Master/Slave resource in Pacemaker with both nodes as Primary since DRBD is running in dual-Primary mode. </div><div><br></div><div>Here is what I did:</div><div>1) Started Pacemaker cluster with DLM+CLVM integrated on both nodes (server4 and server7).</div><div>2) Started DRBD, GFS2 on server4.</div><div>3) Integrated DRBD in Pacemaker on server4 using commands:</div><div>pcs cluster cib drbd_cfg</div><div>pcs -f drbd_cfg resource create drbd_data ocf:linbit:drbd drbd_resource=vDrbd op monitor interval=60s</div><div>pcs -f drbd_cfg resource master drbd_data_clone drbd_data master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true</div><div>4) Now shutdown other node (server7) and the running node (server4) was seen to be working fine. The VM on it did not stop and cLVM commands were also fine.</div><div>5) Bring up server7, start Pacemaker on it and verified none of the nodes hang or crash. VM was also working fine.</div><div><br></div><div>Thanks a lot for everybody's help and suggestions! It has really helped me.<br></div><div><br></div><div>However:</div><div>1) I have not yet integrated GFS2 into Pacemaker and shall do it in sometime. <br></div><div>2) In step5 above I need to activate the LVM using command: lvchange -a y /dev/DRBD_VolGroup}/DRBD_LogicalVolume after starting cluster. Not sure if this is required because GFS2 has not yet been integrated into Pacemaker or it is always required. Shall experiment and publish results.</div><div>3) If DRBD in step2 is not yet started but Pacemaker commands in step3 are done (first time I tried this) then other node is fenced off with error in /var/log/messages: server4 drbd(drbd_data)[12224]: ERROR: meta parameter misconfigured, expected clone-max -le 2, but found unset.</div><div>So basically if Pacemaker is configured for DRBD (step3) before starting DRBD (step2) then other node is stonith'd. Thus I need to start DRBD resources first before integrating Pacemaker with DRBD. <br></div><div><br></div><div><br></div><div>Here is my new Pacemaker Status with new resources highlighted:</div><div><div>[root@server7 ~]# pcs status</div><div>Cluster name: vCluster</div><div>Stack: corosync</div><div>Current DC: server7ha (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum</div><div>Last updated: Thu Mar 30 17:29:11 2017 Last change: Wed Mar 29 21:09:19 2017 by root via cibadmin on server4ha</div><div><br></div><div>2 nodes and 9 resources configured</div><div><br></div><div>Online: [ server4ha server7ha ]</div><div><br></div><div>Full list of resources:</div><div><br></div><div> vCluster-VirtualIP-10.168.10.199 (ocf::heartbeat:IPaddr2): Started server7ha</div><div> vCluster-Stonith-server7ha (stonith:fence_ipmilan): Started server4ha</div><div> vCluster-Stonith-server4ha (stonith:fence_ipmilan): Started server7ha</div><div> Clone Set: dlm-clone [dlm]</div><div> Started: [ server4ha server7ha ]</div><div> Clone Set: clvmd-clone [clvmd]</div><div> Started: [ server4ha server7ha ]</div><div> <b>Master/Slave Set: drbd_data_clone [drbd_data]</b></div><div><b> Masters: [ server4ha server7ha ]</b></div><div><br></div><div>Daemon Status:</div><div> corosync: active/disabled</div><div> pacemaker: active/disabled</div><div> pcsd: active/enabled</div><div>[root@server7 ~]# </div></div><div><br></div><div>Again thanks for help.<br></div><div><br></div><div>--Raman</div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Mar 25, 2017 at 4:58 PM, Raman Gupta <span dir="ltr"><<a href="mailto:ramangupta16@gmail.com" target="_blank">ramangupta16@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi,<div><br></div><div>Thanks for the detailed explanation and sample examples.</div><div><br></div><div>I will work on suggestions about missing DRBD-Pacemaker, GFS2-Pacemaker configuration and re-check fencing configuration and I will let you know the results of my experiments.</div><span class="HOEnZb"><font color="#888888"><div><br></div><div>--Raman</div><div><br></div></font></span></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Mar 25, 2017 at 5:48 AM, Igor Cicimov <span dir="ltr"><<a href="mailto:igorc@encompasscorporation.com" target="_blank">igorc@encompasscorporation.<wbr>com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div><div class="m_1503992010583330512h5"><div><br><div class="gmail_extra"><br><div class="gmail_quote">On 25 Mar 2017 11:00 am, "Igor Cicimov" <<a href="mailto:icicimov@gmail.com" target="_blank">icicimov@gmail.com</a>> wrote:<br type="attribution"><blockquote class="m_1503992010583330512m_7002742704264884126quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Raman,<br><div class="gmail_extra"><br><div class="gmail_quote"><div class="m_1503992010583330512m_7002742704264884126quoted-text">On Sat, Mar 25, 2017 at 12:07 AM, Raman Gupta <span dir="ltr"><<a href="mailto:ramangupta16@gmail.com" target="_blank">ramangupta16@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi,</div><div><br></div><div>Thanks for looking into this issue. Here is my 'pcs status' and attached is cib.xml pacemaker file</div><div><br></div><div><div>[root@server4 cib]# pcs status</div><div>Cluster name: vCluster</div><div>Stack: corosync</div><div>Current DC: server7ha (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum</div><div>Last updated: Fri Mar 24 18:33:05 2017 Last change: Wed Mar 22 13:22:19 2017 by root via cibadmin on server7ha</div><div><br></div><div>2 nodes and 7 resources configured</div><div><br></div><div>Online: [ server4ha server7ha ]</div><div><br></div><div>Full list of resources:</div><div><br></div><div> vCluster-VirtualIP-10.168.10.<wbr>199 (ocf::heartbeat:IPaddr2): Started server7ha</div><div> vCluster-Stonith-server7ha (stonith:fence_ipmilan): Started server4ha</div><div> vCluster-Stonith-server4ha (stonith:fence_ipmilan): Started server7ha</div><div> Clone Set: dlm-clone [dlm]</div><div> Started: [ server4ha server7ha ]</div><div> Clone Set: clvmd-clone [clvmd]</div><div> Started: [ server4ha server7ha ]</div><div><br></div><div>Daemon Status:</div><div> corosync: active/disabled</div><div> pacemaker: active/disabled</div><div> pcsd: active/enabled</div><div>[root@server4 cib]# </div></div><div><br></div></div></blockquote><div><br></div></div><div>This shows us the problem: you have not configured any DRBD resource in Pacemaker hence it has no idea and control over it.</div><div><br></div><div>This is from one of my clusters:</div><div><br></div><div><div>Online: [ sl01 sl02 ]</div><div><br></div><div> p_fence_sl01<span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>(stonith:fence_ipmilan):<span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>Started sl02 </div><div> p_fence_sl02<span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>(stonith:fence_ipmilan):<span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>Started sl01 </div><div><b> Master/Slave Set: ms_drbd [p_drbd_r0]</b></div><div><b> Masters: [ sl01 sl02 ]</b></div><div> Clone Set: cl_dlm [p_controld]</div><div> Started: [ sl01 sl02 ]</div><div> Clone Set: cl_fs_gfs2 [p_fs_gfs2]</div><div> Started: [ sl01 sl02 ]</div></div><div> </div><div>You can notice the resources you are missing in bold, more specifically you have missed to configure DRBD and it's MS resource then colocation and contsraint resources too. So the "resource-and-stonith" hook in your drbd config will never work, Pacemaker does not know about any drbd resources.</div><div><br></div><div>This is from one of my production clusters, it's on Ubuntu so no PCS just CRM and I'm not using cLVM just DLM:</div><div><br></div><div><div>primitive p_controld ocf:pacemaker:controld \</div><div><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>op monitor interval="60" timeout="60" \</div><div><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>op start interval="0" timeout="90" \</div><div><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>op stop interval="0" timeout="100" \</div><div><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>params daemon="dlm_controld" \</div><div><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>meta target-role="Started"</div><div><b>primitive p_drbd_r0 ocf:linbit:drbd \</b></div><div><b><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>params drbd_resource="r0" adjust_master_score="0 10 1000 10000" \</b></div><div><b><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>op monitor interval="10" role="Master" \</b></div><div><b><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>op monitor interval="20" role="Slave" \</b></div><div><b><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>op start interval="0" timeout="240" \</b></div><div><b><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>op stop interval="0" timeout="100"</b></div><div><div><b>ms ms_drbd p_drbd_r0 \</b></div><div><b><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" interleave="true" target-role="Started"</b></div></div><div>primitive p_fs_gfs2 ocf:heartbeat:Filesystem \</div><div><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>params device="/dev/drbd0" directory="/data" fstype="gfs2" options="_netdev,noatime,rw,ac<wbr>l" \</div><div><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>op monitor interval="20" timeout="40" \</div><div><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>op start interval="0" timeout="60" \</div><div><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>op stop interval="0" timeout="60" \</div><div><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>meta is-managed="true"</div><div>clone cl_dlm p_controld \</div><div><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>meta globally-unique="false" interleave="true" target-role="Started"</div><div>clone cl_fs_gfs2 p_fs_gfs2 \</div><div><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-Apple-tab-span" style="white-space:pre-wrap">        </span>meta globally-unique="false" interleave="true" ordered="true" target-role="Started"</div><div>colocation cl_fs_gfs2_dlm inf: cl_fs_gfs2 cl_dlm</div><div><b>colocation co_drbd_dlm inf: cl_dlm ms_drbd:Master</b></div><div>order o_dlm_fs_gfs2 inf: cl_dlm:start cl_fs_gfs2:start</div><div><b>order o_drbd_dlm_fs_gfs2 inf: ms_drbd:promote cl_dlm:start cl_fs_gfs2:start</b></div></div><div><br></div><div>I have excluded the fencing stuff for brevity and highlighted the resources you are missing. Check the rest though as well you might find something you can use or cross-check with your config.</div><div><br></div><div>Also thanks to Digimer about the very useful information (as always) he contributed explaining how the things actually work. </div><div></div></div></div></div></blockquote></div></div></div><div dir="auto"><br></div></div></div><div dir="auto">Just noticed your gfs2 is out of pacemaker control you need to sort that out too.</div><div><div class="m_1503992010583330512h5"><div dir="auto"><br></div><div dir="auto"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="m_1503992010583330512m_7002742704264884126quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="m_1503992010583330512m_7002742704264884126elided-text"><div dir="ltr"><div></div></div><div class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-HOEnZb"><div class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-h5"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 24, 2017 at 1:49 PM, Raman Gupta <span dir="ltr"><<a href="mailto:ramangupta16@gmail.com" target="_blank">ramangupta16@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi All,<div><br></div><div><div>I am having a problem where if in GFS2 dual-Primary-DRBD Pacemaker Cluster, a node crashes then the running node hangs! The CLVM commands hang, the libvirt VM on running node hangs. </div><div><br></div><div>Env:</div><div>---------</div><div>CentOS 7.3</div><div>DRBD 8.4 </div><div>gfs2-utils-3.1.9-3.el7.x86_64<br></div><div>Pacemaker 1.1.15-11.el7_3.4<br></div><div>corosync-2.4.0-4.el7.x86_64<br></div><div><br></div><div><br></div><div>Infrastructure:</div><div>------------------------</div><div><div>1) Running A 2 node Pacemaker Cluster with proper fencing between the two. Nodes are server4 and server7.</div><div><br></div><div>2) Running DRBD dual-Primary and hosting GFS2 filesystem.</div><div><br></div><div>3) Pacemaker has DLM and cLVM resources configured among others.</div><div><br></div><div>4) A KVM/QEMU virtual machine is running on server4 which is holding the cluster resources.</div><div><br></div></div><div><br></div><div>Normal:</div><div>------------</div><div>5) In normal condition when the two nodes are completely UP then things are fine. The DRBD dual-primary works fine. The disk of VM is hosted on DRBD mount directory /backup and VM runs fine with Live Migration happily happening between the 2 nodes.</div><div><br></div><div><br></div><div>Problem:</div><div>----------------</div><div>6) Stop server7 [shutdown -h now] ---> LVM commands like pvdisplay hangs, VM runs only for 120s ---> After 120s DRBD/GFS2 panics (/var/log/messages below) in server4 and DRBD mount directory (/backup) becomes unavailable and VM hangs in server4. The DRBD though is fine on server4 and in Primary/Secondary mode in WFConnection state.<br></div><div><br></div><div>Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: invoked for vDrbd</div><div>Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: WARNING drbd-fencing could not determine the master id of drbd resource vDrbd</div><div>Mar 24 11:29:28 server4 kernel: drbd vDrbd: helper command: /sbin/drbdadm fence-peer vDrbd exit code 1 (0x100)</div><div>Mar 24 11:29:28 server4 kernel: drbd vDrbd: fence-peer helper broken, returned 1</div><div>Mar 24 11:32:01 server4 kernel: INFO: task kworker/8:1H:822 blocked for more than 120 seconds.</div><div>Mar 24 11:32:01 server4 kernel: "echo 0 > /proc/sys/kernel/hung_task_tim<wbr>eout_secs" disables this message.</div><div>Mar 24 11:32:01 server4 kernel: kworker/8:1H D ffff880473796c18 0 822 2 0x00000080</div><div>Mar 24 11:32:01 server4 kernel: Workqueue: glock_workqueue glock_work_func [gfs2]</div><div>Mar 24 11:32:01 server4 kernel: ffff88027674bb10 0000000000000046 ffff8802736e9f60 ffff88027674bfd8</div><div>Mar 24 11:32:01 server4 kernel: ffff88027674bfd8 ffff88027674bfd8 ffff8802736e9f60 ffff8804757ef808</div><div>Mar 24 11:32:01 server4 kernel: 0000000000000000 ffff8804757efa28 ffff8804757ef800 ffff880473796c18</div><div>Mar 24 11:32:01 server4 kernel: Call Trace:</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffff8168bbb9>] schedule+0x29/0x70</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffffa0714ce4>] drbd_make_request+0x2a4/0x380 [drbd]</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffff812e0000>] ? aes_decrypt+0x260/0xe10</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffff810b17d0>] ? wake_up_atomic_t+0x30/0x30</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffff812ee6f9>] generic_make_request+0x109/0x1<wbr>e0</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffff812ee841>] submit_bio+0x71/0x150</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffffa063ee11>] gfs2_meta_read+0x121/0x2a0 [gfs2]</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffffa063f392>] gfs2_meta_indirect_buffer+0x62<wbr>/0x150 [gfs2]</div><div>Mar 24 11:32:01 server4 kernel: [<ffffffff810d2422>] ? load_balance+0x192/0x990</div><div><br></div><div>7) After server7 is UP, Pacemaker Cluster is started, DRBD started and Logical Volume activated and only after that in server4 the DRBD mount directory (/backup) becomes available and VM resumes in server4. So after server7 is down and till it is completely UP the VM in server4 hangs.</div></div><div><br></div><div><br></div><div>Can anyone help how to avoid running node hang when other node crashes?</div><div><br></div><div><br></div><div>Attaching DRBD config file.</div><span class="m_1503992010583330512m_7002742704264884126m_8360417868564894854gmail-m_7105576696870626311HOEnZb"><font color="#888888"><div><br></div><div><br></div><div>--Raman</div><div><br></div></font></span></div>
</blockquote></div><br></div>
</div></div><br></div><div class="m_1503992010583330512m_7002742704264884126quoted-text">______________________________<wbr>_________________<br>
drbd-user mailing list<br>
<a href="mailto:drbd-user@lists.linbit.com" target="_blank">drbd-user@lists.linbit.com</a><br>
<a href="http://lists.linbit.com/mailman/listinfo/drbd-user" rel="noreferrer" target="_blank">http://lists.linbit.com/mailma<wbr>n/listinfo/drbd-user</a><br>
<br></div></blockquote></div><br></div></div>
<br>______________________________<wbr>_________________<br>
drbd-user mailing list<br>
<a href="mailto:drbd-user@lists.linbit.com" target="_blank">drbd-user@lists.linbit.com</a><br>
<a href="http://lists.linbit.com/mailman/listinfo/drbd-user" rel="noreferrer" target="_blank">http://lists.linbit.com/mailma<wbr>n/listinfo/drbd-user</a><br>
<br></blockquote></div><br></div></div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>