[DRBD-user] [drbd 9] system hang with huge number of ASSERTION failure in dmesg

Fri Jun 9 17:39:05 CEST 2017

Hi,

I am experimenting with DRBD dual-primary with OCFS 2, and DRBD client as
well.
With the hope that every node can access the storage in an unified way.
But I got a
kernel call trace and huge number of ASSERTION failure (*before* OCFS2 is
mounted):

----<paste begins>----
[11160.192091] INFO: task drbdsetup:19442 blocked for more than 120 seconds.
[11160.192096]       Tainted: G           OE   4.1.12-37.2.2.el7uek.x86_64
#2
[11160.192097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[11160.192099] drbdsetup       D ffff88013fd17840     0 19442      1
0x00000084
[11160.192108]  ffff8800addef8c8 0000000000000082 ffff88013a3d3800
ffff8800369eb800
[11160.192111]  ffff8800addef938 ffff8800addf0000 ffff8800adb192c0
7fffffffffffffff
[11160.192113]  ffff8800369eb800 0000000000000297 ffff8800addef8e8
ffffffff81712947
[11160.192116] Call Trace:
[11160.192128]  [<ffffffff81712947>] schedule+0x37/0x90
[11160.192131]  [<ffffffff8171596c>] schedule_timeout+0x20c/0x280
[11160.192134]  [<ffffffff817158b6>] ? schedule_timeout+0x156/0x280
[11160.192148]  [<ffffffffa05c2695>] ? drbd_destroy_path+0x15/0x20 [drbd]
[11160.192152]  [<ffffffff817134b4>] wait_for_completion+0x134/0x190
[11160.192157]  [<ffffffff810b1d90>] ? wake_up_state+0x20/0x20
[11160.192165]  [<ffffffffa05c4d51>] _drbd_thread_stop+0xc1/0x110 [drbd]
[11160.192173]  [<ffffffffa05dd84c>] del_connection+0x3c/0x140 [drbd]
[11160.192179]  [<ffffffffa05e0bd3>] drbd_adm_down+0xc3/0x2c0 [drbd]
[11160.192184]  [<ffffffff8162886d>] genl_family_rcv_msg+0x1cd/0x400
[11160.192186]  [<ffffffff81628aa0>] ? genl_family_rcv_msg+0x400/0x400
[11160.192188]  [<ffffffff81628b31>] genl_rcv_msg+0x91/0xd0
[11160.192190]  [<ffffffff81627901>] netlink_rcv_skb+0xc1/0xe0
[11160.192192]  [<ffffffff81627fec>] genl_rcv+0x2c/0x40
[11160.192193]  [<ffffffff81626f86>] netlink_unicast+0x106/0x210
[11160.192195]  [<ffffffff816274c4>] netlink_sendmsg+0x434/0x690
[11160.192199]  [<ffffffff815d66ed>] sock_sendmsg+0x3d/0x50
[11160.192201]  [<ffffffff815d6785>] sock_write_iter+0x85/0xf0
[11160.192205]  [<ffffffff81209f6e>] __vfs_write+0xce/0x120
[11160.192207]  [<ffffffff8120a619>] vfs_write+0xa9/0x1b0
[11160.192210]  [<ffffffff8102587c>] ? do_audit_syscall_entry+0x6c/0x70
[11160.192213]  [<ffffffff8120b505>] SyS_write+0x55/0xd0
[11160.192215]  [<ffffffff81716aee>] system_call_fastpath+0x12/0x71
[11163.573075] __bm_op: 84153300 callbacks suppressed
[11163.573075] drbd r0/0 drbd100: ASSERTION bitmap->bm_pages FAILED in
__bm_op
[10968.421046] drbd r0/0 drbd100: ASSERTION bitmap->bm_pages FAILED in
__bm_op
[10968.421046] drbd r0/0 drbd100: ASSERTION bitmap->bm_pages FAILED in
__bm_op
[10968.421046] drbd r0/0 drbd100: ASSERTION bitmap->bm_pages FAILED in
__bm_op
[10973.403026] __bm_op: 84588466 callbacks suppressed
[10973.403026] drbd r0/0 drbd100: ASSERTION bitmap->bm_pages FAILED in
__bm_op
[10973.403026] drbd r0/0 drbd100: ASSERTION bitmap->bm_pages FAILED in
__bm_op
[10973.403026] drbd r0/0 drbd100: ASSERTION bitmap->bm_pages FAILED in
__bm_op
----<paste ends>----

'grep -c' shows tens of thousands of the ASSERTION error as shown above.

The call trace (and node got rebooted automatically) happened in a DRBD
client node.

Any insights?

Thanks in advance.

# cat /proc/drbd
version: 9.0.7-1 (api:2/proto:86-112)

My DRBD resource configuration:
resource r0 {
        handlers {
                split-brain "/usr/lib64/drbd/notify-split-brain.sh root";
        }
        startup {
                become-primary-on both;
        }
        connection-mesh {
                hosts 10-0-149-20 10-0-147-191 10-0-218-14 10-0-183-69;
        }
        on 10-0-149-20 {
                node-id   0;
                address ipv4 10.0.149.20:7789;
                volume 0 {
                        device minor 100;
                        disk   /dev/disk/by-id/wwn-0x000f5ab58042677f;
                        meta-disk internal;
                }
        }
        on 10-0-147-191 {
                node-id   1;
                address ipv4 10.0.147.191:7789;
                volume 0 {
                        device minor 100;
                        disk   /dev/disk/by-id/wwn-0x000f5ab58042677f;
                        meta-disk internal;
                }
        }
        # DRBD client
        on 10-0-218-14 {
                node-id 2;
                address ipv4 10.0.218.14:7789;
                volume 0 {
                        device minor 100;
                        disk none;
                        meta-disk internal;
                }
        }
        # DRBD client
        on 10-0-183-69 {
                node-id 3;
                address ipv4 10.0.183.69:7789;
                volume 0 {
                        device minor 100;
                        disk none;
                        meta-disk internal;
                }
        }
        net {
                after-sb-0pri discard-zero-changes;
                after-sb-1pri discard-secondary;
                after-sb-2pri disconnect;
                fencing resource-and-stonith;
                protocol C;
                allow-two-primaries yes;
                sndbuf-size 0;
        }
}

-- 
Thanks,
Li Qun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170609/2e270f51/attachment.htm>