[DRBD-user] GFS2 - DualPrimaryDRBD hangs if a node Crashes

Fri Mar 24 09:19:02 CET 2017

Hi All,

I am having a problem where if in GFS2 dual-Primary-DRBD Pacemaker Cluster,
a node crashes then the running node hangs! The CLVM commands hang, the
libvirt VM on running node hangs.

Env:
---------
CentOS 7.3
DRBD 8.4
gfs2-utils-3.1.9-3.el7.x86_64
Pacemaker 1.1.15-11.el7_3.4
corosync-2.4.0-4.el7.x86_64

Infrastructure:
------------------------
1) Running A 2 node Pacemaker Cluster with proper fencing between the two.
Nodes are server4 and server7.

2) Running DRBD dual-Primary and hosting GFS2 filesystem.

3) Pacemaker has DLM and cLVM resources configured among others.

4) A KVM/QEMU virtual machine is running on server4 which is holding the
cluster resources.

Normal:
------------
5) In normal condition when the two nodes are completely UP then things are
fine. The DRBD dual-primary works fine. The disk of VM is hosted on DRBD
mount directory /backup and VM runs fine with Live Migration happily
happening between the 2 nodes.

Problem:
----------------
6) Stop server7 [shutdown -h now] ---> LVM commands like pvdisplay hangs,
VM runs only for 120s ---> After 120s DRBD/GFS2 panics (/var/log/messages
below) in server4 and DRBD mount directory (/backup) becomes unavailable
and VM hangs in server4. The DRBD though is fine on server4 and in
Primary/Secondary mode in WFConnection state.

Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: invoked for vDrbd
Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: WARNING drbd-fencing
could not determine the master id of drbd resource vDrbd
Mar 24 11:29:28 server4 kernel: drbd vDrbd: helper command: /sbin/drbdadm
fence-peer vDrbd exit code 1 (0x100)
Mar 24 11:29:28 server4 kernel: drbd vDrbd: fence-peer helper broken,
returned 1
Mar 24 11:32:01 server4 kernel: INFO: task kworker/8:1H:822 blocked for
more than 120 seconds.
Mar 24 11:32:01 server4 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 24 11:32:01 server4 kernel: kworker/8:1H    D ffff880473796c18     0
822      2 0x00000080
Mar 24 11:32:01 server4 kernel: Workqueue: glock_workqueue glock_work_func
[gfs2]
Mar 24 11:32:01 server4 kernel: ffff88027674bb10 0000000000000046
ffff8802736e9f60 ffff88027674bfd8
Mar 24 11:32:01 server4 kernel: ffff88027674bfd8 ffff88027674bfd8
ffff8802736e9f60 ffff8804757ef808
Mar 24 11:32:01 server4 kernel: 0000000000000000 ffff8804757efa28
ffff8804757ef800 ffff880473796c18
Mar 24 11:32:01 server4 kernel: Call Trace:
Mar 24 11:32:01 server4 kernel: [<ffffffff8168bbb9>] schedule+0x29/0x70
Mar 24 11:32:01 server4 kernel: [<ffffffffa0714ce4>]
drbd_make_request+0x2a4/0x380 [drbd]
Mar 24 11:32:01 server4 kernel: [<ffffffff812e0000>] ?
aes_decrypt+0x260/0xe10
Mar 24 11:32:01 server4 kernel: [<ffffffff810b17d0>] ?
wake_up_atomic_t+0x30/0x30
Mar 24 11:32:01 server4 kernel: [<ffffffff812ee6f9>]
generic_make_request+0x109/0x1e0
Mar 24 11:32:01 server4 kernel: [<ffffffff812ee841>] submit_bio+0x71/0x150
Mar 24 11:32:01 server4 kernel: [<ffffffffa063ee11>]
gfs2_meta_read+0x121/0x2a0 [gfs2]
Mar 24 11:32:01 server4 kernel: [<ffffffffa063f392>]
gfs2_meta_indirect_buffer+0x62/0x150 [gfs2]
Mar 24 11:32:01 server4 kernel: [<ffffffff810d2422>] ?
load_balance+0x192/0x990

7) After server7 is UP, Pacemaker Cluster is started, DRBD started and
Logical Volume activated and only after that in server4 the DRBD mount
directory (/backup) becomes available and VM resumes in server4.  So after
server7 is down and till it is completely UP the VM in server4 hangs.

Can anyone help how to avoid running node hang when other node crashes?

Attaching DRBD config file.

--Raman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170324/98b730fe/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: global_common.conf
Type: application/octet-stream
Size: 306 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170324/98b730fe/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vDrbd.res
Type: application/octet-stream
Size: 629 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170324/98b730fe/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.conf
Type: application/octet-stream
Size: 468 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170324/98b730fe/attachment-0002.obj>