[DRBD-user] DRBD bug when an inactive resource is taken down

Lars Ellenberg lars.ellenberg at linbit.com
Wed May 3 14:12:30 CEST 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Sat, Apr 29, 2017 at 02:28:38PM +0100, Trevor Hemsley wrote:
> Hi
> 
> For a number of years on both DRBD 8.3 and 8.4 I have been seeing a
> weird bug that I've never reported because it's too difficult to
> recreate reliably. Today I got something similar and have a stacktrace
> so thought I'd send it in in the hope that it might point in the right
> direction to getting this fixed.
> 
> The situation goes like this:
> Set up a DRBD device and get it replicating
> Leave said device in Secondary/Secondary for days/weeks/months (not sure
> how long it takes to happen)
> At some future point, run `drbdadm down $resource` on one of the two
> systems.
> This now hangs doing nothing forever or until you switch to the other
> system and run `drbdadm down $resource` there as well at which point
> both go down and all is ok.

Likely your TCP got stuck.
Some iptables or other firewall in between?
Blocking ICMP or "unknown" RST or whatnot?

Maybe we need some additional timeout somewhere to cope with such
unresponsive "half" connection.  I would assume that this should
have resolved itself after a full TCP timeout
(whatever that may be on your box).

> In an attempt to stop this happening on one pair of systems where I have
> a permanently inactive (i.e secondary/secondary) resource waiting for
> someone to buy me a Windows license so I can install Windows on a VM
> using that DRBD resource, I left it in primary/secondary even though
> nothing is using it. Today I went to shutdown -r the primary system and
> it got as far as stopping DRBD on the system with this resource in
> primary and it hung and I got the following stacktrace in the logs:
> 
> Apr 29 12:57:19 xen23 kernel: block drbd8: role( Primary -> Secondary )
> Apr 29 12:57:19 xen23 kernel: block drbd8: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
> Apr 29 13:01:05 xen23 kernel: INFO: task drbdsetup-84:81150 blocked for more than 120 seconds.
> Apr 29 13:01:05 xen23 kernel:      Not tainted 2.6.32-642.13.1.el6.x86_64 #1
> Apr 29 13:01:05 xen23 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Apr 29 13:01:05 xen23 kernel: drbdsetup-84  D 0000000000000003     0 81150  81128 0x00000080
> Apr 29 13:01:05 xen23 kernel: ffff880c6195b8a8 0000000000000082 ffff880c6195b828 ffffffffa06e6c18
> Apr 29 13:01:05 xen23 kernel: ffff880c6042f8d0 00000000000110aa ffff880c67f6e000 0000000000000000
> Apr 29 13:01:05 xen23 kernel: 0000000000000001 00000000000110aa ffff880c6723fad8 ffff880c6195bfd8
> Apr 29 13:01:05 xen23 kernel: Call Trace:
> Apr 29 13:01:05 xen23 kernel: [<ffffffffa06e6c18>] ? is_valid_state+0x98/0x500 [drbd]
> Apr 29 13:01:05 xen23 kernel: [<ffffffffa06e89a7>] _conn_request_state+0x8e7/0xb00 [drbd]
> Apr 29 13:01:05 xen23 kernel: [<ffffffffa06d1942>] ? drbd_send_command+0x42/0x50 [drbd]
> Apr 29 13:01:05 xen23 kernel: [<ffffffff810a68a0>] ? autoremove_wake_function+0x0/0x40
> Apr 29 13:01:05 xen23 kernel: [<ffffffffa06e8c0d>] conn_request_state+0x4d/0x80 [drbd]
> Apr 29 13:01:05 xen23 kernel: [<ffffffffa06d7f12>] conn_try_disconnect+0x22/0x120 [drbd]
> Apr 29 13:01:05 xen23 kernel: [<ffffffffa06e3366>] drbd_adm_down+0x106/0x240 [drbd]

> As usual, running drbdadm down $resource on the other side promptly
> brought this one back to life and the shutdown continued and it rebooted
> OK afterwards.
> 
> This was running a slightly out of date CentOS 6 kernel and ELRepo's
> kmod-drbd84-8.4.9-1.el6.elrepo.x86_64 which should be plain DRBD 8.4.9.
> It was being shut down to bring it back up on 6.9 with the latest kernel.
> 
> Oh, and despite the system's name being xen23, it's not running xen.
> It's a plain CentOS 6.8 system running kvm and several VMs. Hasn't been
> xen in a long time.
> 
> Not sure if this might help to diagnose this slightly weird issue.

Thanks,

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed



More information about the drbd-user mailing list