[DRBD-user] drbdsetup (again) stuck on blocking I/O

Fri Jun 22 14:39:28 CEST 2018

Hello,
DRBD9 is really a great piece of software but from time to time, we end stuck in
a situation without other solution than reboot.

For exemple, right now, when we run :
# drdbadm status
It display some ressources than hang on a specific ressource and finally returns
"Command 'drbdsetup status' did not terminate within 5 seconds".

And drdsetup processus stacks. drbdmanage is completely out of order on both
nodes (see below).

Running drbdsetup status with strace runs until problematic ressource and displays :
> write(3, "4\0\0\0\34\0\1\3\227\251,[\330f\0\0\37\2\0\0\377\377\377\377\0\0\0\0\30\0\2\0"..., 52) = 52
> poll([{fd=1, events=POLLHUP}, {fd=3, events=POLLIN}], 2, 120000) = 1 ([{fd=3, revents=POLLIN}])
> poll([{fd=3, events=POLLIN}], 1, -1)    = 1 ([{fd=3, revents=POLLIN}])
> recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=720, type=0x1c /* NLMSG_??? */, flags=NLM_F_MULTI, seq=1529653655, pid=26328}, "\37\2\0\0\244\0\0\0e\0\0\0 \0\2\0\10\0\1@\0\0\0\0\22\0\2 at vm-1"...}, {{len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}}], iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK) = 720
> poll([{fd=3, events=POLLIN}], 1, -1)    = 1 ([{fd=3, revents=POLLIN}])
> recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=720, type=0x1c /* NLMSG_??? */, flags=NLM_F_MULTI, seq=1529653655, pid=26328}, "\37\2\0\0\244\0\0\0e\0\0\0 \0\2\0\10\0\1@\0\0\0\0\22\0\2 at vm-1"...}, {{len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}}], iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 720
> poll([{fd=1, events=POLLHUP}, {fd=3, events=POLLIN}], 2, 120000) = 1 ([{fd=3, revents=POLLIN}])
> poll([{fd=3, events=POLLIN}], 1, -1)    = 1 ([{fd=3, revents=POLLIN}])
> recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=1529653655, pid=26328}, "\0\0\0\0"}, {{len=164, type=0x65 /* NLMSG_??? */, flags=0, seq=131104, pid=65544}, "\0\0\0\0\22\0\2\0vm-145-disk-1\0\0\0\330\0\3\0(\0\1\0"...}, {{len=6433, type=0x5 /* NLMSG_??? */, flags=NLM_F_DUMP_INTR, seq=0, pid=1114117}, "\1\0\0\0\5\0\22\0\1\0\0\0\5\0\23\0\0\0\0\0\10\0\24\0\0\0\0\0\10\0\25\0"...}, {{len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}}], iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK) = 20
> poll([{fd=3, events=POLLIN}], 1, -1)    = 1 ([{fd=3, revents=POLLIN}])
> recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=1529653655, pid=26328}, "\0\0\0\0"}, {{len=164, type=0x65 /* NLMSG_??? */, flags=0, seq=131104, pid=65544}, "\0\0\0\0\22\0\2\0vm-145-disk-1\0\0\0\330\0\3\0(\0\1\0"...}, {{len=6433, type=0x5 /* NLMSG_??? */, flags=NLM_F_DUMP_INTR, seq=0, pid=1114117}, "\1\0\0\0\5\0\22\0\1\0\0\0\5\0\23\0\0\0\0\0\10\0\24\0\0\0\0\0\10\0\25\0"...}, {{len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}}], iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20
> write(3, "4\0\0\0\34\0\1\3\230\251,[\330f\0\0 \2\0\0\377\377\377\377\0\0\0\0\30\0\2\0"..., 52

drbdadm runs fine on node 2.

I don't exactly see how to interpret this.

Finally, I can see that node 1 is keeping the drbdctrl ressource as primary :
something must have gone wrong on this node.

drbdtop actually runs correctly and shows for the problematic ressouce :
volume 0 (/dev/drbd164): UpToDate(normal disk state) Blocked: upper
and :
Connection to node2(Unknown): NetworkFailure(lost connection to node2)

How can I debug such situation without rebooting node1 ?

This is not the time we're encountering such situation and rebooting each time
is really a pain, we're talking of highly available clusters.

Any other info I can provide ?

Thanks a lot !

Best regards,
Julien Escario

P.S. : drbdmanage output

On node 1 (actual drbdctrl primary) :

# drbdmanage r
ERROR:dbus.proxies:Introspect error on :1.53:/interface:
dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not
receive a reply. Possible causes include: the remote application did not send a
reply, the message bus security policy blocked the reply, the reply timeout
expired, or the network connection was broken.

Error: Cannot connect to the drbdmanaged process using DBus
The DBus subsystem returned the following error description:
org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes
include: the remote application did not send a reply, the message bus security
policy blocked the reply, the reply timeout expired, or the network connection
was broken.

On node 2 (shown drbdctrl as secondary)
# drbdmanage r
Waiting for server: ...............
Error: Satellite could not request control volume from leader
No resources defined