[PATCH] drbd: Fix IO block after network failure

Philipp Reisner philipp.reisner at linbit.com
Thu Mar 20 07:36:44 CET 2025


Hi Zhengbing,

Yes, I verified your findings and applied this patch with tiny
modifications to make it checkpatch.pl compliant.
https://github.com/LINBIT/drbd/commit/4e28788df7f935ed78042f74b0969dd7fc0c7eb7

Thanks!

Best regards,
 Philipp


On Wed, Feb 19, 2025 at 4:10 AM zhengbing.huang
<zhengbing.huang at easystack.cn> wrote:
>
> Network failure test, I/O is not finished.
> The oldest_request has follow status information:
>
> master: pending|postponed       local: in-AL|completed|ok       net[1]: queued|done : C|barr
>
> This req also has RQ_NET_QUEUED,so its reference count
> cannot be reduced to zero and req cannot complete.
>
> The commit 8962f7c03c1
> drbd: exclude requests that are not yet queued from "seen_dagtag_sector"
> has modify the __next_request_for_connection() function,
> which causes the sender thread to be unable to clean up all
> pending req when the network failure.
>
> The race occurred as follows, where T is a submit req thread,
> and S is a sender thread:
> S: process_one_request() handle r0
> S: network failure. drbd_send_dblock(r0) fail, then call __req_mod(r0, SEND_FAILED...)
> S: Call mod_rq_state(), r0 clear RQ_NET_QUEUED, and still has RQ_NET_PENDING
> T: r1 arrive drbd_send_and_submit(), add to transfer_log, and set RQ_NET_QUEUED
> S: drbd_sender() handle network failure, change_cstate(C_NETWORK_FAILURE)
>
> When sender thread state change to stop, and want to
> cleanup all currently unprocessed requests(call __req_mod(req, SEND_CANCELED...)).
> but it can not find r1, because in the __next_request_for_connection() function,
> r0 always satisfies the first if condition and returns NULL.
> static struct drbd_request *__next_request_for_connection(...)
> {
> ...
>                 if (unlikely(s & RQ_NET_PENDING && !(s & (RQ_NET_QUEUED|RQ_NET_SENT))))
>                         return NULL;
> ...
> }
> Finally, r1 could not be completed due to has RQ_NET_QUEUED.
>
> So, In the cleanup process of sender,
> we find all the req with RQ_NET_QUEUED and clean it.
>
> Signed-off-by: zhengbing.huang <zhengbing.huang at easystack.cn>
> ---
>  drbd/drbd_sender.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/drbd/drbd_sender.c b/drbd/drbd_sender.c
> index 80badc606..e6fc751c7 100644
> --- a/drbd/drbd_sender.c
> +++ b/drbd/drbd_sender.c
> @@ -3251,6 +3251,24 @@ static struct drbd_request *tl_next_request_for_connection(struct drbd_connectio
>         return connection->todo.req;
>  }
>
> +static struct drbd_request *tl_next_request_for_cleanup(struct drbd_connection *connection)
> +{
> +       struct drbd_request *req;
> +       struct drbd_request *found_req = NULL;
> +
> +       list_for_each_entry_rcu(req, &connection->resource->transfer_log, tl_requests) {
> +               unsigned s = req->net_rq_state[connection->peer_node_id];
> +
> +               if (s & RQ_NET_QUEUED) {
> +                       found_req = req;
> +                       break;
> +               }
> +       }
> +
> +       connection->todo.req = found_req;
> +       return connection->todo.req;
> +}
> +
>  static void maybe_send_state_afer_ahead(struct drbd_connection *connection)
>  {
>         struct drbd_peer_device *peer_device;
> @@ -3644,7 +3662,7 @@ int drbd_sender(struct drbd_thread *thi)
>         /* cleanup all currently unprocessed requests */
>         if (!connection->todo.req) {
>                 rcu_read_lock();
> -               tl_next_request_for_connection(connection);
> +               tl_next_request_for_cleanup(connection);
>                 rcu_read_unlock();
>         }
>         while (connection->todo.req) {
> @@ -3660,7 +3678,7 @@ int drbd_sender(struct drbd_thread *thi)
>                         complete_master_bio(device, &m);
>
>                 rcu_read_lock();
> -               tl_next_request_for_connection(connection);
> +               tl_next_request_for_cleanup(connection);
>                 rcu_read_unlock();
>         }
>
> --
> 2.43.0
>


More information about the drbd-dev mailing list