Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Fri, Jun 09, 2017 at 11:39:05PM +0800, David Lee wrote: > Hi, > > I am experimenting with DRBD dual-primary with OCFS 2, and DRBD client as > well. > With the hope that every node can access the storage in an unified way. > But I got a > kernel call trace and huge number of ASSERTION failure (*before* OCFS2 is > mounted): > > ----<paste begins>---- > [11160.192091] INFO: task drbdsetup:19442 blocked for more than 120 seconds. > [11160.192096] Tainted: G OE 4.1.12-37.2.2.el7uek.x86_64 > #2 > [11160.192097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [11160.192099] drbdsetup D ffff88013fd17840 0 19442 1 > 0x00000084 > [11160.192108] ffff8800addef8c8 0000000000000082 ffff88013a3d3800 > ffff8800369eb800 > [11160.192111] ffff8800addef938 ffff8800addf0000 ffff8800adb192c0 > 7fffffffffffffff > [11160.192113] ffff8800369eb800 0000000000000297 ffff8800addef8e8 > ffffffff81712947 > [11160.192116] Call Trace: > [11160.192128] [<ffffffff81712947>] schedule+0x37/0x90 > [11160.192131] [<ffffffff8171596c>] schedule_timeout+0x20c/0x280 > [11160.192134] [<ffffffff817158b6>] ? schedule_timeout+0x156/0x280 > [11160.192148] [<ffffffffa05c2695>] ? drbd_destroy_path+0x15/0x20 [drbd] > [11160.192152] [<ffffffff817134b4>] wait_for_completion+0x134/0x190 > [11160.192157] [<ffffffff810b1d90>] ? wake_up_state+0x20/0x20 > [11160.192165] [<ffffffffa05c4d51>] _drbd_thread_stop+0xc1/0x110 [drbd] > [11160.192173] [<ffffffffa05dd84c>] del_connection+0x3c/0x140 [drbd] > [11160.192179] [<ffffffffa05e0bd3>] drbd_adm_down+0xc3/0x2c0 [drbd] > [11160.192184] [<ffffffff8162886d>] genl_family_rcv_msg+0x1cd/0x400 > [11163.573075] __bm_op: 84153300 callbacks suppressed > [11163.573075] drbd r0/0 drbd100: ASSERTION bitmap->bm_pages FAILED in The assertion is that the bitmap pages are supposed to be allocated when we do bitmap operations. Apparently in this case, they are not. So either the bitmap pages have never been allocated, and our error handling for that case sucks, or they are freed too early, while "something" still wants to flip or count some bits. But I would have expected someone to notice something like that before. Strange. Lars