Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, Thanks for looking into this issue. Here is my 'pcs status' and attached is cib.xml pacemaker file [root at server4 cib]# pcs status Cluster name: vCluster Stack: corosync Current DC: server7ha (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum Last updated: Fri Mar 24 18:33:05 2017 Last change: Wed Mar 22 13:22:19 2017 by root via cibadmin on server7ha 2 nodes and 7 resources configured Online: [ server4ha server7ha ] Full list of resources: vCluster-VirtualIP-10.168.10.199 (ocf::heartbeat:IPaddr2): Started server7ha vCluster-Stonith-server7ha (stonith:fence_ipmilan): Started server4ha vCluster-Stonith-server4ha (stonith:fence_ipmilan): Started server7ha Clone Set: dlm-clone [dlm] Started: [ server4ha server7ha ] Clone Set: clvmd-clone [clvmd] Started: [ server4ha server7ha ] Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled [root at server4 cib]# On Fri, Mar 24, 2017 at 1:49 PM, Raman Gupta <ramangupta16 at gmail.com> wrote: > Hi All, > > I am having a problem where if in GFS2 dual-Primary-DRBD Pacemaker > Cluster, a node crashes then the running node hangs! The CLVM commands > hang, the libvirt VM on running node hangs. > > Env: > --------- > CentOS 7.3 > DRBD 8.4 > gfs2-utils-3.1.9-3.el7.x86_64 > Pacemaker 1.1.15-11.el7_3.4 > corosync-2.4.0-4.el7.x86_64 > > > Infrastructure: > ------------------------ > 1) Running A 2 node Pacemaker Cluster with proper fencing between the two. > Nodes are server4 and server7. > > 2) Running DRBD dual-Primary and hosting GFS2 filesystem. > > 3) Pacemaker has DLM and cLVM resources configured among others. > > 4) A KVM/QEMU virtual machine is running on server4 which is holding the > cluster resources. > > > Normal: > ------------ > 5) In normal condition when the two nodes are completely UP then things > are fine. The DRBD dual-primary works fine. The disk of VM is hosted on > DRBD mount directory /backup and VM runs fine with Live Migration happily > happening between the 2 nodes. > > > Problem: > ---------------- > 6) Stop server7 [shutdown -h now] ---> LVM commands like pvdisplay hangs, > VM runs only for 120s ---> After 120s DRBD/GFS2 panics (/var/log/messages > below) in server4 and DRBD mount directory (/backup) becomes unavailable > and VM hangs in server4. The DRBD though is fine on server4 and in > Primary/Secondary mode in WFConnection state. > > Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: invoked for vDrbd > Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: WARNING drbd-fencing > could not determine the master id of drbd resource vDrbd > Mar 24 11:29:28 server4 kernel: drbd vDrbd: helper command: /sbin/drbdadm > fence-peer vDrbd exit code 1 (0x100) > Mar 24 11:29:28 server4 kernel: drbd vDrbd: fence-peer helper broken, > returned 1 > Mar 24 11:32:01 server4 kernel: INFO: task kworker/8:1H:822 blocked for > more than 120 seconds. > Mar 24 11:32:01 server4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > Mar 24 11:32:01 server4 kernel: kworker/8:1H D ffff880473796c18 0 > 822 2 0x00000080 > Mar 24 11:32:01 server4 kernel: Workqueue: glock_workqueue glock_work_func > [gfs2] > Mar 24 11:32:01 server4 kernel: ffff88027674bb10 0000000000000046 > ffff8802736e9f60 ffff88027674bfd8 > Mar 24 11:32:01 server4 kernel: ffff88027674bfd8 ffff88027674bfd8 > ffff8802736e9f60 ffff8804757ef808 > Mar 24 11:32:01 server4 kernel: 0000000000000000 ffff8804757efa28 > ffff8804757ef800 ffff880473796c18 > Mar 24 11:32:01 server4 kernel: Call Trace: > Mar 24 11:32:01 server4 kernel: [<ffffffff8168bbb9>] schedule+0x29/0x70 > Mar 24 11:32:01 server4 kernel: [<ffffffffa0714ce4>] > drbd_make_request+0x2a4/0x380 [drbd] > Mar 24 11:32:01 server4 kernel: [<ffffffff812e0000>] ? > aes_decrypt+0x260/0xe10 > Mar 24 11:32:01 server4 kernel: [<ffffffff810b17d0>] ? > wake_up_atomic_t+0x30/0x30 > Mar 24 11:32:01 server4 kernel: [<ffffffff812ee6f9>] > generic_make_request+0x109/0x1e0 > Mar 24 11:32:01 server4 kernel: [<ffffffff812ee841>] submit_bio+0x71/0x150 > Mar 24 11:32:01 server4 kernel: [<ffffffffa063ee11>] > gfs2_meta_read+0x121/0x2a0 [gfs2] > Mar 24 11:32:01 server4 kernel: [<ffffffffa063f392>] > gfs2_meta_indirect_buffer+0x62/0x150 [gfs2] > Mar 24 11:32:01 server4 kernel: [<ffffffff810d2422>] ? > load_balance+0x192/0x990 > > 7) After server7 is UP, Pacemaker Cluster is started, DRBD started and > Logical Volume activated and only after that in server4 the DRBD mount > directory (/backup) becomes available and VM resumes in server4. So after > server7 is down and till it is completely UP the VM in server4 hangs. > > > Can anyone help how to avoid running node hang when other node crashes? > > > Attaching DRBD config file. > > > --Raman > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170324/d5176a52/attachment.htm> -------------- next part -------------- A non-text attachment was scrubbed... Name: cib.xml Type: text/xml Size: 7477 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170324/d5176a52/attachment.bin>