[DRBD-user] [drbd 9] system hang with huge number of ASSERTION failure in dmesg

Lars Ellenberg lars.ellenberg at linbit.com
Mon Jun 12 11:45:24 CEST 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Fri, Jun 09, 2017 at 11:39:05PM +0800, David Lee wrote:
> Hi,
> 
> I am experimenting with DRBD dual-primary with OCFS 2, and DRBD client as
> well.
> With the hope that every node can access the storage in an unified way.
> But I got a
> kernel call trace and huge number of ASSERTION failure (*before* OCFS2 is
> mounted):
> 
> ----<paste begins>----
> [11160.192091] INFO: task drbdsetup:19442 blocked for more than 120 seconds.
> [11160.192096]       Tainted: G           OE   4.1.12-37.2.2.el7uek.x86_64
> #2
> [11160.192097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [11160.192099] drbdsetup       D ffff88013fd17840     0 19442      1
> 0x00000084
> [11160.192108]  ffff8800addef8c8 0000000000000082 ffff88013a3d3800
> ffff8800369eb800
> [11160.192111]  ffff8800addef938 ffff8800addf0000 ffff8800adb192c0
> 7fffffffffffffff
> [11160.192113]  ffff8800369eb800 0000000000000297 ffff8800addef8e8
> ffffffff81712947
> [11160.192116] Call Trace:
> [11160.192128]  [<ffffffff81712947>] schedule+0x37/0x90
> [11160.192131]  [<ffffffff8171596c>] schedule_timeout+0x20c/0x280
> [11160.192134]  [<ffffffff817158b6>] ? schedule_timeout+0x156/0x280
> [11160.192148]  [<ffffffffa05c2695>] ? drbd_destroy_path+0x15/0x20 [drbd]
> [11160.192152]  [<ffffffff817134b4>] wait_for_completion+0x134/0x190
> [11160.192157]  [<ffffffff810b1d90>] ? wake_up_state+0x20/0x20
> [11160.192165]  [<ffffffffa05c4d51>] _drbd_thread_stop+0xc1/0x110 [drbd]
> [11160.192173]  [<ffffffffa05dd84c>] del_connection+0x3c/0x140 [drbd]
> [11160.192179]  [<ffffffffa05e0bd3>] drbd_adm_down+0xc3/0x2c0 [drbd]
> [11160.192184]  [<ffffffff8162886d>] genl_family_rcv_msg+0x1cd/0x400

> [11163.573075] __bm_op: 84153300 callbacks suppressed
> [11163.573075] drbd r0/0 drbd100: ASSERTION bitmap->bm_pages FAILED in


The assertion is that the bitmap pages are supposed to be allocated
when we do bitmap operations.

Apparently in this case, they are not.

So either the bitmap pages have never been allocated, and our error
handling for that case sucks, or they are freed too early, while
"something" still wants to flip or count some bits.  But I would have
expected someone to notice something like that before. Strange.

    Lars




More information about the drbd-user mailing list