[DRBD-user] DRBD bug when an inactive resource is taken down

Sat Apr 29 15:28:38 CEST 2017

Hi

For a number of years on both DRBD 8.3 and 8.4 I have been seeing a
weird bug that I've never reported because it's too difficult to
recreate reliably. Today I got something similar and have a stacktrace
so thought I'd send it in in the hope that it might point in the right
direction to getting this fixed.

The situation goes like this:
Set up a DRBD device and get it replicating
Leave said device in Secondary/Secondary for days/weeks/months (not sure
how long it takes to happen)
At some future point, run `drbdadm down $resource` on one of the two
systems.
This now hangs doing nothing forever or until you switch to the other
system and run `drbdadm down $resource` there as well at which point
both go down and all is ok.

In an attempt to stop this happening on one pair of systems where I have
a permanently inactive (i.e secondary/secondary) resource waiting for
someone to buy me a Windows license so I can install Windows on a VM
using that DRBD resource, I left it in primary/secondary even though
nothing is using it. Today I went to shutdown -r the primary system and
it got as far as stopping DRBD on the system with this resource in
primary and it hung and I got the following stacktrace in the logs:

Apr 29 12:57:19 xen23 kernel: block drbd8: role( Primary -> Secondary )
Apr 29 12:57:19 xen23 kernel: block drbd8: 0 KB (0 bits) marked
out-of-sync by on disk bit-map.
Apr 29 13:01:05 xen23 kernel: INFO: task drbdsetup-84:81150 blocked for
more than 120 seconds.
Apr 29 13:01:05 xen23 kernel:      Not tainted 2.6.32-642.13.1.el6.x86_64 #1
Apr 29 13:01:05 xen23 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 29 13:01:05 xen23 kernel: drbdsetup-84  D 0000000000000003     0
81150  81128 0x00000080
Apr 29 13:01:05 xen23 kernel: ffff880c6195b8a8 0000000000000082
ffff880c6195b828 ffffffffa06e6c18
Apr 29 13:01:05 xen23 kernel: ffff880c6042f8d0 00000000000110aa
ffff880c67f6e000 0000000000000000
Apr 29 13:01:05 xen23 kernel: 0000000000000001 00000000000110aa
ffff880c6723fad8 ffff880c6195bfd8
Apr 29 13:01:05 xen23 kernel: Call Trace:
Apr 29 13:01:05 xen23 kernel: [<ffffffffa06e6c18>] ?
is_valid_state+0x98/0x500 [drbd]
Apr 29 13:01:05 xen23 kernel: [<ffffffffa06e89a7>]
_conn_request_state+0x8e7/0xb00 [drbd]
Apr 29 13:01:05 xen23 kernel: [<ffffffffa06d1942>] ?
drbd_send_command+0x42/0x50 [drbd]
Apr 29 13:01:05 xen23 kernel: [<ffffffff810a68a0>] ?
autoremove_wake_function+0x0/0x40
Apr 29 13:01:05 xen23 kernel: [<ffffffffa06e8c0d>]
conn_request_state+0x4d/0x80 [drbd]
Apr 29 13:01:05 xen23 kernel: [<ffffffffa06d7f12>]
conn_try_disconnect+0x22/0x120 [drbd]
Apr 29 13:01:05 xen23 kernel: [<ffffffffa06e3366>]
drbd_adm_down+0x106/0x240 [drbd]
Apr 29 13:01:05 xen23 kernel: [<ffffffff814a8b91>] genl_rcv_msg+0x271/0x340
Apr 29 13:01:05 xen23 kernel: [<ffffffff814a8920>] ? genl_rcv_msg+0x0/0x340
Apr 29 13:01:05 xen23 kernel: [<ffffffff814a7789>] netlink_rcv_skb+0xa9/0xd0
Apr 29 13:01:05 xen23 kernel: [<ffffffff814a86dc>] genl_rcv+0x2c/0x40
Apr 29 13:01:05 xen23 kernel: [<ffffffff814a73af>]
netlink_unicast+0x2df/0x330
Apr 29 13:01:05 xen23 kernel: [<ffffffff814a7e13>]
netlink_sendmsg+0x2a3/0x3e0
Apr 29 13:01:05 xen23 kernel: [<ffffffff81466c7b>]
sock_aio_write+0x19b/0x1c0
Apr 29 13:01:05 xen23 kernel: [<ffffffff8119996a>] do_sync_write+0xfa/0x140
Apr 29 13:01:05 xen23 kernel: [<ffffffff810a68a0>] ?
autoremove_wake_function+0x0/0x40
Apr 29 13:01:05 xen23 kernel: [<ffffffff81247d9f>] ?
selinux_file_permission+0xbf/0x150
Apr 29 13:01:05 xen23 kernel: [<ffffffff81014b19>] ? read_tsc+0x9/0x10
Apr 29 13:01:05 xen23 kernel: [<ffffffff8123aa66>] ?
security_file_permission+0x16/0x20
Apr 29 13:01:05 xen23 kernel: [<ffffffff81199d34>] vfs_write+0x184/0x1a0
Apr 29 13:01:05 xen23 kernel: [<ffffffff8119b156>] ?
fget_light_pos+0x16/0x50
Apr 29 13:01:05 xen23 kernel: [<ffffffff8119a7a1>] sys_write+0x51/0xb0
Apr 29 13:01:05 xen23 kernel: [<ffffffff8100b0d2>]
system_call_fastpath+0x16/0x1b

As usual, running drbdadm down $resource on the other side promptly
brought this one back to life and the shutdown continued and it rebooted
OK afterwards.

This was running a slightly out of date CentOS 6 kernel and ELRepo's
kmod-drbd84-8.4.9-1.el6.elrepo.x86_64 which should be plain DRBD 8.4.9.
It was being shut down to bring it back up on 6.9 with the latest kernel.

Oh, and despite the system's name being xen23, it's not running xen.
It's a plain CentOS 6.8 system running kvm and several VMs. Hasn't been
xen in a long time.

Not sure if this might help to diagnose this slightly weird issue.

Trevor