[PATCH] drbd: Fix IO block after network failure
zhengbing.huang
zhengbing.huang at easystack.cn
Wed Feb 19 04:05:06 CET 2025
Network failure test, I/O is not finished.
The oldest_request has follow status information:
master: pending|postponed local: in-AL|completed|ok net[1]: queued|done : C|barr
This req also has RQ_NET_QUEUED,so its reference count
cannot be reduced to zero and req cannot complete.
The commit 8962f7c03c1
drbd: exclude requests that are not yet queued from "seen_dagtag_sector"
has modify the __next_request_for_connection() function,
which causes the sender thread to be unable to clean up all
pending req when the network failure.
The race occurred as follows, where T is a submit req thread,
and S is a sender thread:
S: process_one_request() handle r0
S: network failure. drbd_send_dblock(r0) fail, then call __req_mod(r0, SEND_FAILED...)
S: Call mod_rq_state(), r0 clear RQ_NET_QUEUED, and still has RQ_NET_PENDING
T: r1 arrive drbd_send_and_submit(), add to transfer_log, and set RQ_NET_QUEUED
S: drbd_sender() handle network failure, change_cstate(C_NETWORK_FAILURE)
When sender thread state change to stop, and want to
cleanup all currently unprocessed requests(call __req_mod(req, SEND_CANCELED...)).
but it can not find r1, because in the __next_request_for_connection() function,
r0 always satisfies the first if condition and returns NULL.
static struct drbd_request *__next_request_for_connection(...)
{
...
if (unlikely(s & RQ_NET_PENDING && !(s & (RQ_NET_QUEUED|RQ_NET_SENT))))
return NULL;
...
}
Finally, r1 could not be completed due to has RQ_NET_QUEUED.
So, In the cleanup process of sender,
we find all the req with RQ_NET_QUEUED and clean it.
Signed-off-by: zhengbing.huang <zhengbing.huang at easystack.cn>
---
drbd/drbd_sender.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/drbd/drbd_sender.c b/drbd/drbd_sender.c
index 80badc606..e6fc751c7 100644
--- a/drbd/drbd_sender.c
+++ b/drbd/drbd_sender.c
@@ -3251,6 +3251,24 @@ static struct drbd_request *tl_next_request_for_connection(struct drbd_connectio
return connection->todo.req;
}
+static struct drbd_request *tl_next_request_for_cleanup(struct drbd_connection *connection)
+{
+ struct drbd_request *req;
+ struct drbd_request *found_req = NULL;
+
+ list_for_each_entry_rcu(req, &connection->resource->transfer_log, tl_requests) {
+ unsigned s = req->net_rq_state[connection->peer_node_id];
+
+ if (s & RQ_NET_QUEUED) {
+ found_req = req;
+ break;
+ }
+ }
+
+ connection->todo.req = found_req;
+ return connection->todo.req;
+}
+
static void maybe_send_state_afer_ahead(struct drbd_connection *connection)
{
struct drbd_peer_device *peer_device;
@@ -3644,7 +3662,7 @@ int drbd_sender(struct drbd_thread *thi)
/* cleanup all currently unprocessed requests */
if (!connection->todo.req) {
rcu_read_lock();
- tl_next_request_for_connection(connection);
+ tl_next_request_for_cleanup(connection);
rcu_read_unlock();
}
while (connection->todo.req) {
@@ -3660,7 +3678,7 @@ int drbd_sender(struct drbd_thread *thi)
complete_master_bio(device, &m);
rcu_read_lock();
- tl_next_request_for_connection(connection);
+ tl_next_request_for_cleanup(connection);
rcu_read_unlock();
}
--
2.43.0
More information about the drbd-dev
mailing list