[PATCH 3/4] rdma: When the post send fails, change the connection state to C_NETWORK_FAILURE

zhengbing.huang zhengbing.huang at easystack.cn
Sat Dec 6 09:12:26 CET 2025


In rdma mode, the sync process gets stuck.

We found that on the recv side, the flow expects the rx_sequence
of the next rx_desc is 92831, but the minimum rx_sequence
of rx_desc on the current rx_descs list is 92833.
memory info:
      rx_sequence = 92831,
      rdma_transport = 0xff1616c673f3aa60
      }, {
    send_wq = {
crash> list -o dtr_rx_desc.list -s dtr_rx_desc.sequence  -H 0xff1616bf0cd80008
ff1616e0208346c0
  sequence = 92833
ff1616c4a7f78600
  sequence = 92834

Then we found the log of fail post send_wr on the send side:
kernel: infiniband mlx5_1: mlx5_ib_post_send:1101:(pid 43146):
kernel: drbd drbd1 rdma: ib_post_send() failed -12

The problem is that when ib_post_send() fails to post a send_wr,
flush_send_buffer() returns an error, but its caller does not check the return value,
and continues to run. Because tx/rx_sequence increments both in send and recv side,
it eventually leads to recv side constantly wait for the rx_desc that
has already fail to be post.

The solution is to use drbd_control_event() to disconnect the current connection.

Signed-off-by: zhengbing.huang <zhengbing.huang at easystack.cn>
---
 drbd/drbd_transport_rdma.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drbd/drbd_transport_rdma.c b/drbd/drbd_transport_rdma.c
index 5a28f58f0..ce7be2549 100644
--- a/drbd/drbd_transport_rdma.c
+++ b/drbd/drbd_transport_rdma.c
@@ -1936,8 +1936,10 @@ static int dtr_handle_tx_cq_event(struct ib_cq *cq, struct dtr_cm *cm)
 			err = dtr_repost_tx_desc(cm, tx_desc);
 			if (!err)
 				tx_desc = NULL; /* it is in the air again! Fly! */
-			else if (__ratelimit(&rdma_transport->rate_limit))
+			else if (__ratelimit(&rdma_transport->rate_limit)) {
 				tr_warn(transport, "repost of tx_desc failed! %d\n", err);
+				drbd_control_event(transport, CLOSED_BY_PEER);
+			}
 		}
 	}
 
@@ -3234,6 +3236,9 @@ static int dtr_send_page(struct drbd_transport *transport, enum drbd_stream stre
 	if (err) {
 		put_page(page);
 		kfree(tx_desc);
+
+		tr_err(transport, "dtr_post_tx_desc() failed %d\n", err);
+		drbd_control_event(transport, CLOSED_BY_PEER);
 	}
 
 	if (stream == DATA_STREAM)
@@ -3320,6 +3325,9 @@ static int dtr_send_bio_part(struct dtr_transport *rdma_transport,
 			}
 			kfree(tx_desc);
 		}
+
+		tr_err(transport, "dtr_post_tx_desc() failed %d\n", err);
+		drbd_control_event(transport, CLOSED_BY_PEER);
 	}
 
 	return err;
-- 
2.43.0



More information about the drbd-dev mailing list