Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Fri, Mar 24, 2017 at 7:19 PM, Raman Gupta <ramangupta16 at gmail.com> wrote: > Hi All, > > I am having a problem where if in GFS2 dual-Primary-DRBD Pacemaker > Cluster, a node crashes then the running node hangs! The CLVM commands > hang, the libvirt VM on running node hangs. > > Env: > --------- > CentOS 7.3 > DRBD 8.4 > gfs2-utils-3.1.9-3.el7.x86_64 > Pacemaker 1.1.15-11.el7_3.4 > corosync-2.4.0-4.el7.x86_64 > > > Infrastructure: > ------------------------ > 1) Running A 2 node Pacemaker Cluster with proper fencing between the two. > Nodes are server4 and server7. > > 2) Running DRBD dual-Primary and hosting GFS2 filesystem. > > 3) Pacemaker has DLM and cLVM resources configured among others. > > 4) A KVM/QEMU virtual machine is running on server4 which is holding the > cluster resources. > > > Normal: > ------------ > 5) In normal condition when the two nodes are completely UP then things > are fine. The DRBD dual-primary works fine. The disk of VM is hosted on > DRBD mount directory /backup and VM runs fine with Live Migration happily > happening between the 2 nodes. > > > Problem: > ---------------- > 6) Stop server7 [shutdown -h now] ---> LVM commands like pvdisplay hangs, > VM runs only for 120s ---> After 120s DRBD/GFS2 panics (/var/log/messages > below) in server4 and DRBD mount directory (/backup) becomes unavailable > and VM hangs in server4. The DRBD though is fine on server4 and in > Primary/Secondary mode in WFConnection state. > > Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: invoked for vDrbd > Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: WARNING drbd-fencing > could not determine the master id of drbd resource vDrbd > *Mar 24 11:29:28 server4 kernel: drbd vDrbd: helper command: /sbin/drbdadm > fence-peer vDrbd exit code 1 (0x100)* > *Mar 24 11:29:28 server4 kernel: drbd vDrbd: fence-peer helper broken, > returned 1* > I guess this is the problem. Since the drbd fencing script fails DLM will hang to avoid resource corruption since it has no information about the status of the other node. > Mar 24 11:32:01 server4 kernel: INFO: task kworker/8:1H:822 blocked for > more than 120 seconds. > Mar 24 11:32:01 server4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > Mar 24 11:32:01 server4 kernel: kworker/8:1H D ffff880473796c18 0 > 822 2 0x00000080 > Mar 24 11:32:01 server4 kernel: Workqueue: glock_workqueue glock_work_func > [gfs2] > Mar 24 11:32:01 server4 kernel: ffff88027674bb10 0000000000000046 > ffff8802736e9f60 ffff88027674bfd8 > Mar 24 11:32:01 server4 kernel: ffff88027674bfd8 ffff88027674bfd8 > ffff8802736e9f60 ffff8804757ef808 > Mar 24 11:32:01 server4 kernel: 0000000000000000 ffff8804757efa28 > ffff8804757ef800 ffff880473796c18 > Mar 24 11:32:01 server4 kernel: Call Trace: > Mar 24 11:32:01 server4 kernel: [<ffffffff8168bbb9>] schedule+0x29/0x70 > Mar 24 11:32:01 server4 kernel: [<ffffffffa0714ce4>] > drbd_make_request+0x2a4/0x380 [drbd] > Mar 24 11:32:01 server4 kernel: [<ffffffff812e0000>] ? > aes_decrypt+0x260/0xe10 > Mar 24 11:32:01 server4 kernel: [<ffffffff810b17d0>] ? > wake_up_atomic_t+0x30/0x30 > Mar 24 11:32:01 server4 kernel: [<ffffffff812ee6f9>] > generic_make_request+0x109/0x1e0 > Mar 24 11:32:01 server4 kernel: [<ffffffff812ee841>] submit_bio+0x71/0x150 > Mar 24 11:32:01 server4 kernel: [<ffffffffa063ee11>] > gfs2_meta_read+0x121/0x2a0 [gfs2] > Mar 24 11:32:01 server4 kernel: [<ffffffffa063f392>] > gfs2_meta_indirect_buffer+0x62/0x150 [gfs2] > Mar 24 11:32:01 server4 kernel: [<ffffffff810d2422>] ? > load_balance+0x192/0x990 > > 7) After server7 is UP, Pacemaker Cluster is started, DRBD started and > Logical Volume activated and only after that in server4 the DRBD mount > directory (/backup) becomes available and VM resumes in server4. So after > server7 is down and till it is completely UP the VM in server4 hangs. > > > Can anyone help how to avoid running node hang when other node crashes? > > > Attaching DRBD config file. > > Do you actually have fencing configured in pacemaker? Since you have drbd fencing policy set to "resource-and-stonith" you *must* have fencing setup in pacemaker too. Have you also set no-quorum-policy="ignore" in pacemaker? maybe show us your pacemaker config too so we don't have to guess.... Not related to the problem but I would also add "after-resync-target" handler too: handlers { ... fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; } > > --Raman > > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170324/9f779b84/attachment.htm>