Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi All, I am having a problem where if in GFS2 dual-Primary-DRBD Pacemaker Cluster, a node crashes then the running node hangs! The CLVM commands hang, the libvirt VM on running node hangs. Env: --------- CentOS 7.3 DRBD 8.4 gfs2-utils-3.1.9-3.el7.x86_64 Pacemaker 1.1.15-11.el7_3.4 corosync-2.4.0-4.el7.x86_64 Infrastructure: ------------------------ 1) Running A 2 node Pacemaker Cluster with proper fencing between the two. Nodes are server4 and server7. 2) Running DRBD dual-Primary and hosting GFS2 filesystem. 3) Pacemaker has DLM and cLVM resources configured among others. 4) A KVM/QEMU virtual machine is running on server4 which is holding the cluster resources. Normal: ------------ 5) In normal condition when the two nodes are completely UP then things are fine. The DRBD dual-primary works fine. The disk of VM is hosted on DRBD mount directory /backup and VM runs fine with Live Migration happily happening between the 2 nodes. Problem: ---------------- 6) Stop server7 [shutdown -h now] ---> LVM commands like pvdisplay hangs, VM runs only for 120s ---> After 120s DRBD/GFS2 panics (/var/log/messages below) in server4 and DRBD mount directory (/backup) becomes unavailable and VM hangs in server4. The DRBD though is fine on server4 and in Primary/Secondary mode in WFConnection state. Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: invoked for vDrbd Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: WARNING drbd-fencing could not determine the master id of drbd resource vDrbd Mar 24 11:29:28 server4 kernel: drbd vDrbd: helper command: /sbin/drbdadm fence-peer vDrbd exit code 1 (0x100) Mar 24 11:29:28 server4 kernel: drbd vDrbd: fence-peer helper broken, returned 1 Mar 24 11:32:01 server4 kernel: INFO: task kworker/8:1H:822 blocked for more than 120 seconds. Mar 24 11:32:01 server4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 24 11:32:01 server4 kernel: kworker/8:1H D ffff880473796c18 0 822 2 0x00000080 Mar 24 11:32:01 server4 kernel: Workqueue: glock_workqueue glock_work_func [gfs2] Mar 24 11:32:01 server4 kernel: ffff88027674bb10 0000000000000046 ffff8802736e9f60 ffff88027674bfd8 Mar 24 11:32:01 server4 kernel: ffff88027674bfd8 ffff88027674bfd8 ffff8802736e9f60 ffff8804757ef808 Mar 24 11:32:01 server4 kernel: 0000000000000000 ffff8804757efa28 ffff8804757ef800 ffff880473796c18 Mar 24 11:32:01 server4 kernel: Call Trace: Mar 24 11:32:01 server4 kernel: [<ffffffff8168bbb9>] schedule+0x29/0x70 Mar 24 11:32:01 server4 kernel: [<ffffffffa0714ce4>] drbd_make_request+0x2a4/0x380 [drbd] Mar 24 11:32:01 server4 kernel: [<ffffffff812e0000>] ? aes_decrypt+0x260/0xe10 Mar 24 11:32:01 server4 kernel: [<ffffffff810b17d0>] ? wake_up_atomic_t+0x30/0x30 Mar 24 11:32:01 server4 kernel: [<ffffffff812ee6f9>] generic_make_request+0x109/0x1e0 Mar 24 11:32:01 server4 kernel: [<ffffffff812ee841>] submit_bio+0x71/0x150 Mar 24 11:32:01 server4 kernel: [<ffffffffa063ee11>] gfs2_meta_read+0x121/0x2a0 [gfs2] Mar 24 11:32:01 server4 kernel: [<ffffffffa063f392>] gfs2_meta_indirect_buffer+0x62/0x150 [gfs2] Mar 24 11:32:01 server4 kernel: [<ffffffff810d2422>] ? load_balance+0x192/0x990 7) After server7 is UP, Pacemaker Cluster is started, DRBD started and Logical Volume activated and only after that in server4 the DRBD mount directory (/backup) becomes available and VM resumes in server4. So after server7 is down and till it is completely UP the VM in server4 hangs. Can anyone help how to avoid running node hang when other node crashes? Attaching DRBD config file. --Raman -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170324/98b730fe/attachment.htm> -------------- next part -------------- A non-text attachment was scrubbed... Name: global_common.conf Type: application/octet-stream Size: 306 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170324/98b730fe/attachment.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: vDrbd.res Type: application/octet-stream Size: 629 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170324/98b730fe/attachment-0001.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: corosync.conf Type: application/octet-stream Size: 468 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170324/98b730fe/attachment-0002.obj>