[DRBD-user] GFS2 - DualPrimaryDRBD hangs if a node Crashes

Fri Mar 24 14:07:48 CET 2017

Hi,

Thanks for looking into this issue. Here is my 'pcs status' and attached is
cib.xml pacemaker file

[root at server4 cib]# pcs status
Cluster name: vCluster
Stack: corosync
Current DC: server7ha (version 1.1.15-11.el7_3.4-e174ec8) - partition with
quorum
Last updated: Fri Mar 24 18:33:05 2017          Last change: Wed Mar 22
13:22:19 2017 by root via cibadmin on server7ha

2 nodes and 7 resources configured

Online: [ server4ha server7ha ]

Full list of resources:

 vCluster-VirtualIP-10.168.10.199       (ocf::heartbeat:IPaddr2):
Started server7ha
 vCluster-Stonith-server7ha     (stonith:fence_ipmilan):        Started
server4ha
 vCluster-Stonith-server4ha     (stonith:fence_ipmilan):        Started
server7ha
 Clone Set: dlm-clone [dlm]
     Started: [ server4ha server7ha ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ server4ha server7ha ]

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
[root at server4 cib]#

On Fri, Mar 24, 2017 at 1:49 PM, Raman Gupta <ramangupta16 at gmail.com> wrote:

> Hi All,
>
> I am having a problem where if in GFS2 dual-Primary-DRBD Pacemaker
> Cluster, a node crashes then the running node hangs! The CLVM commands
> hang, the libvirt VM on running node hangs.
>
> Env:
> ---------
> CentOS 7.3
> DRBD 8.4
> gfs2-utils-3.1.9-3.el7.x86_64
> Pacemaker 1.1.15-11.el7_3.4
> corosync-2.4.0-4.el7.x86_64
>
>
> Infrastructure:
> ------------------------
> 1) Running A 2 node Pacemaker Cluster with proper fencing between the two.
> Nodes are server4 and server7.
>
> 2) Running DRBD dual-Primary and hosting GFS2 filesystem.
>
> 3) Pacemaker has DLM and cLVM resources configured among others.
>
> 4) A KVM/QEMU virtual machine is running on server4 which is holding the
> cluster resources.
>
>
> Normal:
> ------------
> 5) In normal condition when the two nodes are completely UP then things
> are fine. The DRBD dual-primary works fine. The disk of VM is hosted on
> DRBD mount directory /backup and VM runs fine with Live Migration happily
> happening between the 2 nodes.
>
>
> Problem:
> ----------------
> 6) Stop server7 [shutdown -h now] ---> LVM commands like pvdisplay hangs,
> VM runs only for 120s ---> After 120s DRBD/GFS2 panics (/var/log/messages
> below) in server4 and DRBD mount directory (/backup) becomes unavailable
> and VM hangs in server4. The DRBD though is fine on server4 and in
> Primary/Secondary mode in WFConnection state.
>
> Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: invoked for vDrbd
> Mar 24 11:29:28 server4 crm-fence-peer.sh[54702]: WARNING drbd-fencing
> could not determine the master id of drbd resource vDrbd
> Mar 24 11:29:28 server4 kernel: drbd vDrbd: helper command: /sbin/drbdadm
> fence-peer vDrbd exit code 1 (0x100)
> Mar 24 11:29:28 server4 kernel: drbd vDrbd: fence-peer helper broken,
> returned 1
> Mar 24 11:32:01 server4 kernel: INFO: task kworker/8:1H:822 blocked for
> more than 120 seconds.
> Mar 24 11:32:01 server4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> Mar 24 11:32:01 server4 kernel: kworker/8:1H    D ffff880473796c18     0
> 822      2 0x00000080
> Mar 24 11:32:01 server4 kernel: Workqueue: glock_workqueue glock_work_func
> [gfs2]
> Mar 24 11:32:01 server4 kernel: ffff88027674bb10 0000000000000046
> ffff8802736e9f60 ffff88027674bfd8
> Mar 24 11:32:01 server4 kernel: ffff88027674bfd8 ffff88027674bfd8
> ffff8802736e9f60 ffff8804757ef808
> Mar 24 11:32:01 server4 kernel: 0000000000000000 ffff8804757efa28
> ffff8804757ef800 ffff880473796c18
> Mar 24 11:32:01 server4 kernel: Call Trace:
> Mar 24 11:32:01 server4 kernel: [<ffffffff8168bbb9>] schedule+0x29/0x70
> Mar 24 11:32:01 server4 kernel: [<ffffffffa0714ce4>]
> drbd_make_request+0x2a4/0x380 [drbd]
> Mar 24 11:32:01 server4 kernel: [<ffffffff812e0000>] ?
> aes_decrypt+0x260/0xe10
> Mar 24 11:32:01 server4 kernel: [<ffffffff810b17d0>] ?
> wake_up_atomic_t+0x30/0x30
> Mar 24 11:32:01 server4 kernel: [<ffffffff812ee6f9>]
> generic_make_request+0x109/0x1e0
> Mar 24 11:32:01 server4 kernel: [<ffffffff812ee841>] submit_bio+0x71/0x150
> Mar 24 11:32:01 server4 kernel: [<ffffffffa063ee11>]
> gfs2_meta_read+0x121/0x2a0 [gfs2]
> Mar 24 11:32:01 server4 kernel: [<ffffffffa063f392>]
> gfs2_meta_indirect_buffer+0x62/0x150 [gfs2]
> Mar 24 11:32:01 server4 kernel: [<ffffffff810d2422>] ?
> load_balance+0x192/0x990
>
> 7) After server7 is UP, Pacemaker Cluster is started, DRBD started and
> Logical Volume activated and only after that in server4 the DRBD mount
> directory (/backup) becomes available and VM resumes in server4.  So after
> server7 is down and till it is completely UP the VM in server4 hangs.
>
>
> Can anyone help how to avoid running node hang when other node crashes?
>
>
> Attaching DRBD config file.
>
>
> --Raman
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170324/d5176a52/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cib.xml
Type: text/xml
Size: 7477 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170324/d5176a52/attachment.bin>