[PATCH] drbd: Fix IO block after network failure

zhengbing.huang zhengbing.huang at easystack.cn
Wed Feb 19 04:05:06 CET 2025


Network failure test, I/O is not finished.
The oldest_request has follow status information:

master: pending|postponed	local: in-AL|completed|ok	net[1]: queued|done : C|barr

This req also has RQ_NET_QUEUED,so its reference count
cannot be reduced to zero and req cannot complete.

The commit 8962f7c03c1
drbd: exclude requests that are not yet queued from "seen_dagtag_sector"
has modify the __next_request_for_connection() function,
which causes the sender thread to be unable to clean up all
pending req when the network failure.

The race occurred as follows, where T is a submit req thread,
and S is a sender thread:
S: process_one_request() handle r0
S: network failure. drbd_send_dblock(r0) fail, then call __req_mod(r0, SEND_FAILED...)
S: Call mod_rq_state(), r0 clear RQ_NET_QUEUED, and still has RQ_NET_PENDING
T: r1 arrive drbd_send_and_submit(), add to transfer_log, and set RQ_NET_QUEUED
S: drbd_sender() handle network failure, change_cstate(C_NETWORK_FAILURE)

When sender thread state change to stop, and want to
cleanup all currently unprocessed requests(call __req_mod(req, SEND_CANCELED...)).
but it can not find r1, because in the __next_request_for_connection() function,
r0 always satisfies the first if condition and returns NULL.
static struct drbd_request *__next_request_for_connection(...)
{
...
		if (unlikely(s & RQ_NET_PENDING && !(s & (RQ_NET_QUEUED|RQ_NET_SENT))))
			return NULL;
...
}
Finally, r1 could not be completed due to has RQ_NET_QUEUED.

So, In the cleanup process of sender,
we find all the req with RQ_NET_QUEUED and clean it.

Signed-off-by: zhengbing.huang <zhengbing.huang at easystack.cn>
---
 drbd/drbd_sender.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drbd/drbd_sender.c b/drbd/drbd_sender.c
index 80badc606..e6fc751c7 100644
--- a/drbd/drbd_sender.c
+++ b/drbd/drbd_sender.c
@@ -3251,6 +3251,24 @@ static struct drbd_request *tl_next_request_for_connection(struct drbd_connectio
 	return connection->todo.req;
 }
 
+static struct drbd_request *tl_next_request_for_cleanup(struct drbd_connection *connection)
+{
+	struct drbd_request *req;
+	struct drbd_request *found_req = NULL;
+
+	list_for_each_entry_rcu(req, &connection->resource->transfer_log, tl_requests) {
+		unsigned s = req->net_rq_state[connection->peer_node_id];
+
+		if (s & RQ_NET_QUEUED) {
+			found_req = req;
+			break;
+		}
+	}
+
+	connection->todo.req = found_req;
+	return connection->todo.req;
+}
+
 static void maybe_send_state_afer_ahead(struct drbd_connection *connection)
 {
 	struct drbd_peer_device *peer_device;
@@ -3644,7 +3662,7 @@ int drbd_sender(struct drbd_thread *thi)
 	/* cleanup all currently unprocessed requests */
 	if (!connection->todo.req) {
 		rcu_read_lock();
-		tl_next_request_for_connection(connection);
+		tl_next_request_for_cleanup(connection);
 		rcu_read_unlock();
 	}
 	while (connection->todo.req) {
@@ -3660,7 +3678,7 @@ int drbd_sender(struct drbd_thread *thi)
 			complete_master_bio(device, &m);
 
 		rcu_read_lock();
-		tl_next_request_for_connection(connection);
+		tl_next_request_for_cleanup(connection);
 		rcu_read_unlock();
 	}
 
-- 
2.43.0



More information about the drbd-dev mailing list