Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, Thanks for the detailed explanation and sample examples. I will work on suggestions about missing DRBD-Pacemaker, GFS2-Pacemaker configuration and re-check fencing configuration and I will let you know the results of my experiments. --Raman On Sat, Mar 25, 2017 at 5:48 AM, Igor Cicimov < igorc at encompasscorporation.com> wrote: > > > On 25 Mar 2017 11:00 am, "Igor Cicimov" <icicimov at gmail.com> wrote: > > Raman, > > On Sat, Mar 25, 2017 at 12:07 AM, Raman Gupta <ramangupta16 at gmail.com> > wrote: > >> Hi, >> >> Thanks for looking into this issue. Here is my 'pcs status' and attached >> is cib.xml pacemaker file >> >> [root at server4 cib]# pcs status >> Cluster name: vCluster >> Stack: corosync >> Current DC: server7ha (version 1.1.15-11.el7_3.4-e174ec8) - partition >> with quorum >> Last updated: Fri Mar 24 18:33:05 2017 Last change: Wed Mar 22 >> 13:22:19 2017 by root via cibadmin on server7ha >> >> 2 nodes and 7 resources configured >> >> Online: [ server4ha server7ha ] >> >> Full list of resources: >> >> vCluster-VirtualIP-10.168.10.199 (ocf::heartbeat:IPaddr2): >> Started server7ha >> vCluster-Stonith-server7ha (stonith:fence_ipmilan): Started >> server4ha >> vCluster-Stonith-server4ha (stonith:fence_ipmilan): Started >> server7ha >> Clone Set: dlm-clone [dlm] >> Started: [ server4ha server7ha ] >> Clone Set: clvmd-clone [clvmd] >> Started: [ server4ha server7ha ] >> >> Daemon Status: >> corosync: active/disabled >> pacemaker: active/disabled >> pcsd: active/enabled >> [root at server4 cib]# >> >> > This shows us the problem: you have not configured any DRBD resource in > Pacemaker hence it has no idea and control over it. > > This is from one of my clusters: > > Online: [ sl01 sl02 ] > > p_fence_sl01 (stonith:fence_ipmilan): Started sl02 > p_fence_sl02 (stonith:fence_ipmilan): Started sl01 > * Master/Slave Set: ms_drbd [p_drbd_r0]* > * Masters: [ sl01 sl02 ]* > Clone Set: cl_dlm [p_controld] > Started: [ sl01 sl02 ] > Clone Set: cl_fs_gfs2 [p_fs_gfs2] > Started: [ sl01 sl02 ] > > You can notice the resources you are missing in bold, more specifically > you have missed to configure DRBD and it's MS resource then colocation and > contsraint resources too. So the "resource-and-stonith" hook in your drbd > config will never work, Pacemaker does not know about any drbd resources. > > This is from one of my production clusters, it's on Ubuntu so no PCS just > CRM and I'm not using cLVM just DLM: > > primitive p_controld ocf:pacemaker:controld \ > op monitor interval="60" timeout="60" \ > op start interval="0" timeout="90" \ > op stop interval="0" timeout="100" \ > params daemon="dlm_controld" \ > meta target-role="Started" > *primitive p_drbd_r0 ocf:linbit:drbd \* > * params drbd_resource="r0" adjust_master_score="0 10 1000 10000" \* > * op monitor interval="10" role="Master" \* > * op monitor interval="20" role="Slave" \* > * op start interval="0" timeout="240" \* > * op stop interval="0" timeout="100"* > *ms ms_drbd p_drbd_r0 \* > * meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" > notify="true" interleave="true" target-role="Started"* > primitive p_fs_gfs2 ocf:heartbeat:Filesystem \ > params device="/dev/drbd0" directory="/data" fstype="gfs2" > options="_netdev,noatime,rw,acl" \ > op monitor interval="20" timeout="40" \ > op start interval="0" timeout="60" \ > op stop interval="0" timeout="60" \ > meta is-managed="true" > clone cl_dlm p_controld \ > meta globally-unique="false" interleave="true" target-role="Started" > clone cl_fs_gfs2 p_fs_gfs2 \ > meta globally-unique="false" interleave="true" ordered="true" > target-role="Started" > colocation cl_fs_gfs2_dlm inf: cl_fs_gfs2 cl_dlm > *colocation co_drbd_dlm inf: cl_dlm ms_drbd:Master* > order o_dlm_fs_gfs2 inf: cl_dlm:start cl_fs_gfs2:start > *order o_drbd_dlm_fs_gfs2 inf: ms_drbd:promote cl_dlm:start > cl_fs_gfs2:start* > > I have excluded the fencing stuff for brevity and highlighted the > resources you are missing. Check the rest though as well you might find > something you can use or cross-check with your config. > > Also thanks to Digimer about the very useful information (as always) he > contributed explaining how the things actually work. > > > Just noticed your gfs2 is out of pacemaker control you need to sort that > out too. > > > >> On Fri, Mar 24, 2017 at 1:49 PM, Raman Gupta <ramangupta16 at gmail.com> >> wrote: >> >>> Hi All, >>> >>> I am having a problem where if in GFS2 dual-Primary-DRBD Pacemaker >>> Cluster, a node crashes then the running node hangs! The CLVM commands >>> hang, the libvirt VM on running node hangs. >>> >>> Env: >>> --------- >>> CentOS 7.3 >>> DRBD 8.4 >>> gfs2-utils-3.1.9-3.el7.x86_64 >>> Pacemaker 1.1.15-11.el7_3.4 >>> corosync-2.4.0-4.el7.x86_64 >>> >>> >>> Infrastructure: >>> ------------------------ >>> 1) Running A 2 node Pacemaker Cluster with proper fencing between the >>> two. Nodes are server4 and server7. >>> >>> 2) Running DRBD dual-Primary and hosting GFS2 filesystem. >>> >>> 3) Pacemaker has DLM and cLVM resources configured among others. >>> >>> 4) A KVM/QEMU virtual machine is running on server4 which is holding the >>> cluster resources. >>> >>> >>> Normal: >>> ------------ >>> 5) In normal condition when the two nodes are completely UP then things >>> are fine. The DRBD dual-primary works fine. The disk of VM is hosted on >>> DRBD mount directory /backup and VM runs fine with Live Migration happily >>> happening between the 2 nodes. >>> >>> >>> Problem: >>> ---------------- >>> 6) Stop server7 [shutdown -h now] ---> LVM commands like pvdisplay >>> hangs, VM runs only for 120s ---> After 120s DRBD/GFS2 panics >>> (/var/log/messages below) in server4 and DRBD mount directory (/backup) >>> becomes unavailable and VM hangs in server4. The DRBD though is fine on >>> server4 and in Primary/Secondary mode in WFConnection state. >>> >>> Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: invoked for vDrbd >>> Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: WARNING drbd-fencing >>> could not determine the master id of drbd resource vDrbd >>> Mar 24 11:29:28 server4 kernel: drbd vDrbd: helper command: >>> /sbin/drbdadm fence-peer vDrbd exit code 1 (0x100) >>> Mar 24 11:29:28 server4 kernel: drbd vDrbd: fence-peer helper broken, >>> returned 1 >>> Mar 24 11:32:01 server4 kernel: INFO: task kworker/8:1H:822 blocked for >>> more than 120 seconds. >>> Mar 24 11:32:01 server4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>> disables this message. >>> Mar 24 11:32:01 server4 kernel: kworker/8:1H D ffff880473796c18 0 >>> 822 2 0x00000080 >>> Mar 24 11:32:01 server4 kernel: Workqueue: glock_workqueue >>> glock_work_func [gfs2] >>> Mar 24 11:32:01 server4 kernel: ffff88027674bb10 0000000000000046 >>> ffff8802736e9f60 ffff88027674bfd8 >>> Mar 24 11:32:01 server4 kernel: ffff88027674bfd8 ffff88027674bfd8 >>> ffff8802736e9f60 ffff8804757ef808 >>> Mar 24 11:32:01 server4 kernel: 0000000000000000 ffff8804757efa28 >>> ffff8804757ef800 ffff880473796c18 >>> Mar 24 11:32:01 server4 kernel: Call Trace: >>> Mar 24 11:32:01 server4 kernel: [<ffffffff8168bbb9>] schedule+0x29/0x70 >>> Mar 24 11:32:01 server4 kernel: [<ffffffffa0714ce4>] >>> drbd_make_request+0x2a4/0x380 [drbd] >>> Mar 24 11:32:01 server4 kernel: [<ffffffff812e0000>] ? >>> aes_decrypt+0x260/0xe10 >>> Mar 24 11:32:01 server4 kernel: [<ffffffff810b17d0>] ? >>> wake_up_atomic_t+0x30/0x30 >>> Mar 24 11:32:01 server4 kernel: [<ffffffff812ee6f9>] >>> generic_make_request+0x109/0x1e0 >>> Mar 24 11:32:01 server4 kernel: [<ffffffff812ee841>] >>> submit_bio+0x71/0x150 >>> Mar 24 11:32:01 server4 kernel: [<ffffffffa063ee11>] >>> gfs2_meta_read+0x121/0x2a0 [gfs2] >>> Mar 24 11:32:01 server4 kernel: [<ffffffffa063f392>] >>> gfs2_meta_indirect_buffer+0x62/0x150 [gfs2] >>> Mar 24 11:32:01 server4 kernel: [<ffffffff810d2422>] ? >>> load_balance+0x192/0x990 >>> >>> 7) After server7 is UP, Pacemaker Cluster is started, DRBD started and >>> Logical Volume activated and only after that in server4 the DRBD mount >>> directory (/backup) becomes available and VM resumes in server4. So after >>> server7 is down and till it is completely UP the VM in server4 hangs. >>> >>> >>> Can anyone help how to avoid running node hang when other node crashes? >>> >>> >>> Attaching DRBD config file. >>> >>> >>> --Raman >>> >>> >> >> _______________________________________________ >> drbd-user mailing list >> drbd-user at lists.linbit.com >> http://lists.linbit.com/mailman/listinfo/drbd-user >> >> > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170325/4222693c/attachment.htm>