[DRBD-user] [drbd 9] system hang with huge number of ASSERTION failure in dmesg

Mon Jun 12 15:05:33 CEST 2017

On Mon, Jun 12, 2017 at 5:45 PM, Lars Ellenberg <lars.ellenberg at linbit.com>
wrote:

> On Fri, Jun 09, 2017 at 11:39:05PM +0800, David Lee wrote:
> > Hi,
> >
> > I am experimenting with DRBD dual-primary with OCFS 2, and DRBD client as
> > well.
> > With the hope that every node can access the storage in an unified way.
> > But I got a
> > kernel call trace and huge number of ASSERTION failure (*before* OCFS2 is
> > mounted):
> >
> > ----<paste begins>----
> > [11160.192091] INFO: task drbdsetup:19442 blocked for more than 120
> seconds.
> > [11160.192096]       Tainted: G           OE
>  4.1.12-37.2.2.el7uek.x86_64
> > #2
> > [11160.192097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables
> > this message.
> > [11160.192099] drbdsetup       D ffff88013fd17840     0 19442      1
> > 0x00000084
> > [11160.192108]  ffff8800addef8c8 0000000000000082 ffff88013a3d3800
> > ffff8800369eb800
> > [11160.192111]  ffff8800addef938 ffff8800addf0000 ffff8800adb192c0
> > 7fffffffffffffff
> > [11160.192113]  ffff8800369eb800 0000000000000297 ffff8800addef8e8
> > ffffffff81712947
> > [11160.192116] Call Trace:
> > [11160.192128]  [<ffffffff81712947>] schedule+0x37/0x90
> > [11160.192131]  [<ffffffff8171596c>] schedule_timeout+0x20c/0x280
> > [11160.192134]  [<ffffffff817158b6>] ? schedule_timeout+0x156/0x280
> > [11160.192148]  [<ffffffffa05c2695>] ? drbd_destroy_path+0x15/0x20 [drbd]
> > [11160.192152]  [<ffffffff817134b4>] wait_for_completion+0x134/0x190
> > [11160.192157]  [<ffffffff810b1d90>] ? wake_up_state+0x20/0x20
> > [11160.192165]  [<ffffffffa05c4d51>] _drbd_thread_stop+0xc1/0x110 [drbd]
> > [11160.192173]  [<ffffffffa05dd84c>] del_connection+0x3c/0x140 [drbd]
> > [11160.192179]  [<ffffffffa05e0bd3>] drbd_adm_down+0xc3/0x2c0 [drbd]
> > [11160.192184]  [<ffffffff8162886d>] genl_family_rcv_msg+0x1cd/0x400
>
> > [11163.573075] __bm_op: 84153300 callbacks suppressed
> > [11163.573075] drbd r0/0 drbd100: ASSERTION bitmap->bm_pages FAILED in
>
>
> The assertion is that the bitmap pages are supposed to be allocated
> when we do bitmap operations.
>
> Apparently in this case, they are not.
>
> So either the bitmap pages have never been allocated, and our error
> handling for that case sucks, or they are freed too early, while
> "something" still wants to flip or count some bits.  But I would have
> expected someone to notice something like that before. Strange.
>
>     Lars
>

Thanks for your comments, Lars.

I found other interesting (weird) things with OCFS 2 and DRBD clients, and
moved to
other directions.  The interesting things are:

1. In the scenario of a three-node OCFS2 cluster with dual-primary DRBD and
1 client node,
    the whole cluster fences (every node will reboot) when the DRBD client
node down.

2. If add one more DRBD client node (of course drbd/o2cb/ocfs2 confs are
updated)
    then both client node constantly failed to join with mount.ocfs2
failure.

I've changed the experiment to get rid of OCFS2.  But if any help needed
(for example,
to verify some configuration), please kindly let me know.

-- 
Thanks,
Li Qun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170612/8c30de6b/attachment.htm>