[PATCH] drbd: Fix IO block after network failure
Philipp Reisner
philipp.reisner at linbit.com
Thu Mar 20 07:36:44 CET 2025
Hi Zhengbing,
Yes, I verified your findings and applied this patch with tiny
modifications to make it checkpatch.pl compliant.
https://github.com/LINBIT/drbd/commit/4e28788df7f935ed78042f74b0969dd7fc0c7eb7
Thanks!
Best regards,
Philipp
On Wed, Feb 19, 2025 at 4:10 AM zhengbing.huang
<zhengbing.huang at easystack.cn> wrote:
>
> Network failure test, I/O is not finished.
> The oldest_request has follow status information:
>
> master: pending|postponed local: in-AL|completed|ok net[1]: queued|done : C|barr
>
> This req also has RQ_NET_QUEUED,so its reference count
> cannot be reduced to zero and req cannot complete.
>
> The commit 8962f7c03c1
> drbd: exclude requests that are not yet queued from "seen_dagtag_sector"
> has modify the __next_request_for_connection() function,
> which causes the sender thread to be unable to clean up all
> pending req when the network failure.
>
> The race occurred as follows, where T is a submit req thread,
> and S is a sender thread:
> S: process_one_request() handle r0
> S: network failure. drbd_send_dblock(r0) fail, then call __req_mod(r0, SEND_FAILED...)
> S: Call mod_rq_state(), r0 clear RQ_NET_QUEUED, and still has RQ_NET_PENDING
> T: r1 arrive drbd_send_and_submit(), add to transfer_log, and set RQ_NET_QUEUED
> S: drbd_sender() handle network failure, change_cstate(C_NETWORK_FAILURE)
>
> When sender thread state change to stop, and want to
> cleanup all currently unprocessed requests(call __req_mod(req, SEND_CANCELED...)).
> but it can not find r1, because in the __next_request_for_connection() function,
> r0 always satisfies the first if condition and returns NULL.
> static struct drbd_request *__next_request_for_connection(...)
> {
> ...
> if (unlikely(s & RQ_NET_PENDING && !(s & (RQ_NET_QUEUED|RQ_NET_SENT))))
> return NULL;
> ...
> }
> Finally, r1 could not be completed due to has RQ_NET_QUEUED.
>
> So, In the cleanup process of sender,
> we find all the req with RQ_NET_QUEUED and clean it.
>
> Signed-off-by: zhengbing.huang <zhengbing.huang at easystack.cn>
> ---
> drbd/drbd_sender.c | 22 ++++++++++++++++++++--
> 1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/drbd/drbd_sender.c b/drbd/drbd_sender.c
> index 80badc606..e6fc751c7 100644
> --- a/drbd/drbd_sender.c
> +++ b/drbd/drbd_sender.c
> @@ -3251,6 +3251,24 @@ static struct drbd_request *tl_next_request_for_connection(struct drbd_connectio
> return connection->todo.req;
> }
>
> +static struct drbd_request *tl_next_request_for_cleanup(struct drbd_connection *connection)
> +{
> + struct drbd_request *req;
> + struct drbd_request *found_req = NULL;
> +
> + list_for_each_entry_rcu(req, &connection->resource->transfer_log, tl_requests) {
> + unsigned s = req->net_rq_state[connection->peer_node_id];
> +
> + if (s & RQ_NET_QUEUED) {
> + found_req = req;
> + break;
> + }
> + }
> +
> + connection->todo.req = found_req;
> + return connection->todo.req;
> +}
> +
> static void maybe_send_state_afer_ahead(struct drbd_connection *connection)
> {
> struct drbd_peer_device *peer_device;
> @@ -3644,7 +3662,7 @@ int drbd_sender(struct drbd_thread *thi)
> /* cleanup all currently unprocessed requests */
> if (!connection->todo.req) {
> rcu_read_lock();
> - tl_next_request_for_connection(connection);
> + tl_next_request_for_cleanup(connection);
> rcu_read_unlock();
> }
> while (connection->todo.req) {
> @@ -3660,7 +3678,7 @@ int drbd_sender(struct drbd_thread *thi)
> complete_master_bio(device, &m);
>
> rcu_read_lock();
> - tl_next_request_for_connection(connection);
> + tl_next_request_for_cleanup(connection);
> rcu_read_unlock();
> }
>
> --
> 2.43.0
>
More information about the drbd-dev
mailing list