[DRBD-user] drbdsetup (again) stuck on blocking I/O

Wed Jun 27 15:36:53 CEST 2018

wow, my mails finally made it to the list ... forget it, it's redondant with
my today's thread.

Julien

Le 22/06/2018 à 14:39, Julien Escario a écrit :
> Hello, DRBD9 is really a great piece of software but from time to time, we
> end stuck in a situation without other solution than reboot.
> 
> For exemple, right now, when we run : # drdbadm status It display some
> ressources than hang on a specific ressource and finally returns "Command
> 'drbdsetup status' did not terminate within 5 seconds".
> 
> And drdsetup processus stacks. drbdmanage is completely out of order on
> both nodes (see below).
> 
> Running drbdsetup status with strace runs until problematic ressource and
> displays :
>> write(3,
>> "4\0\0\0\34\0\1\3\227\251,[\330f\0\0\37\2\0\0\377\377\377\377\0\0\0\0\30\0\2\
0"...,
>> 52) = 52 poll([{fd=1, events=POLLHUP}, {fd=3, events=POLLIN}], 2, 120000)
>> = 1 ([{fd=3, revents=POLLIN}]) poll([{fd=3, events=POLLIN}], 1, -1)    =
>> 1 ([{fd=3, revents=POLLIN}]) recvmsg(3, {msg_name={sa_family=AF_NETLINK,
>> nl_pid=0, nl_groups=00000000}, msg_namelen=12,
>> msg_iov=[{iov_base=[{{len=720, type=0x1c /* NLMSG_??? */,
>> flags=NLM_F_MULTI, seq=1529653655, pid=26328},
>> "\37\2\0\0\244\0\0\0e\0\0\0 \0\2\0\10\0\1@\0\0\0\0\22\0\2 at vm-1"...},
>> {{len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}}],
>> iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK) =
>> 720 poll([{fd=3, events=POLLIN}], 1, -1)    = 1 ([{fd=3,
>> revents=POLLIN}]) recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0,
>> nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=720,
>> type=0x1c /* NLMSG_??? */, flags=NLM_F_MULTI, seq=1529653655, pid=26328},
>> "\37\2\0\0\244\0\0\0e\0\0\0 \0\2\0\10\0\1@\0\0\0\0\22\0\2 at vm-1"...},
>> {{len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}}],
>> iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 720 
>> poll([{fd=1, events=POLLHUP}, {fd=3, events=POLLIN}], 2, 120000) = 1
>> ([{fd=3, revents=POLLIN}]) poll([{fd=3, events=POLLIN}], 1, -1)    = 1
>> ([{fd=3, revents=POLLIN}]) recvmsg(3, {msg_name={sa_family=AF_NETLINK,
>> nl_pid=0, nl_groups=00000000}, msg_namelen=12,
>> msg_iov=[{iov_base=[{{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI,
>> seq=1529653655, pid=26328}, "\0\0\0\0"}, {{len=164, type=0x65 /*
>> NLMSG_??? */, flags=0, seq=131104, pid=65544},
>> "\0\0\0\0\22\0\2\0vm-145-disk-1\0\0\0\330\0\3\0(\0\1\0"...}, {{len=6433,
>> type=0x5 /* NLMSG_??? */, flags=NLM_F_DUMP_INTR, seq=0, pid=1114117},
>> "\1\0\0\0\5\0\22\0\1\0\0\0\5\0\23\0\0\0\0\0\10\0\24\0\0\0\0\0\10\0\25\0"...},
>> {{len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}}],
>> iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK) =
>> 20 poll([{fd=3, events=POLLIN}], 1, -1)    = 1 ([{fd=3,
>> revents=POLLIN}]) recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0,
>> nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=20,
>> type=NLMSG_DONE, flags=NLM_F_MULTI, seq=1529653655, pid=26328},
>> "\0\0\0\0"}, {{len=164, type=0x65 /* NLMSG_??? */, flags=0, seq=131104,
>> pid=65544}, "\0\0\0\0\22\0\2\0vm-145-disk-1\0\0\0\330\0\3\0(\0\1\0"...},
>> {{len=6433, type=0x5 /* NLMSG_??? */, flags=NLM_F_DUMP_INTR, seq=0,
>> pid=1114117},
>> "\1\0\0\0\5\0\22\0\1\0\0\0\5\0\23\0\0\0\0\0\10\0\24\0\0\0\0\0\10\0\25\0"...},
>> {{len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}}],
>> iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20 
>> write(3, "4\0\0\0\34\0\1\3\230\251,[\330f\0\0
>> \2\0\0\377\377\377\377\0\0\0\0\30\0\2\0"..., 52
> 
> drbdadm runs fine on node 2.
> 
> I don't exactly see how to interpret this.
> 
> Finally, I can see that node 1 is keeping the drbdctrl ressource as primary
> : something must have gone wrong on this node.
> 
> drbdtop actually runs correctly and shows for the problematic ressouce : 
> volume 0 (/dev/drbd164): UpToDate(normal disk state) Blocked: upper and : 
> Connection to node2(Unknown): NetworkFailure(lost connection to node2)
> 
> How can I debug such situation without rebooting node1 ?
> 
> This is not the time we're encountering such situation and rebooting each
> time is really a pain, we're talking of highly available clusters.
> 
> Any other info I can provide ?
> 
> Thanks a lot !
> 
> Best regards, Julien Escario
> 
> P.S. : drbdmanage output
> 
> On node 1 (actual drbdctrl primary) :
> 
> # drbdmanage r ERROR:dbus.proxies:Introspect error on :1.53:/interface: 
> dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not 
> receive a reply. Possible causes include: the remote application did not
> send a reply, the message bus security policy blocked the reply, the reply
> timeout expired, or the network connection was broken.
> 
> Error: Cannot connect to the drbdmanaged process using DBus The DBus
> subsystem returned the following error description: 
> org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible
> causes include: the remote application did not send a reply, the message
> bus security policy blocked the reply, the reply timeout expired, or the
> network connection was broken.
> 
> On node 2 (shown drbdctrl as secondary) # drbdmanage r Waiting for server:
> ............... Error: Satellite could not request control volume from
> leader No resources defined 
> _______________________________________________ drbd-user mailing list 
> drbd-user at lists.linbit.com 
> http://lists.linbit.com/mailman/listinfo/drbd-user
>