[PATCH 3/4] rdma: When the post send fails, change the connection state to C_NETWORK_FAILURE
zhengbing.huang
zhengbing.huang at easystack.cn
Sat Dec 6 09:12:26 CET 2025
In rdma mode, the sync process gets stuck.
We found that on the recv side, the flow expects the rx_sequence
of the next rx_desc is 92831, but the minimum rx_sequence
of rx_desc on the current rx_descs list is 92833.
memory info:
rx_sequence = 92831,
rdma_transport = 0xff1616c673f3aa60
}, {
send_wq = {
crash> list -o dtr_rx_desc.list -s dtr_rx_desc.sequence -H 0xff1616bf0cd80008
ff1616e0208346c0
sequence = 92833
ff1616c4a7f78600
sequence = 92834
Then we found the log of fail post send_wr on the send side:
kernel: infiniband mlx5_1: mlx5_ib_post_send:1101:(pid 43146):
kernel: drbd drbd1 rdma: ib_post_send() failed -12
The problem is that when ib_post_send() fails to post a send_wr,
flush_send_buffer() returns an error, but its caller does not check the return value,
and continues to run. Because tx/rx_sequence increments both in send and recv side,
it eventually leads to recv side constantly wait for the rx_desc that
has already fail to be post.
The solution is to use drbd_control_event() to disconnect the current connection.
Signed-off-by: zhengbing.huang <zhengbing.huang at easystack.cn>
---
drbd/drbd_transport_rdma.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drbd/drbd_transport_rdma.c b/drbd/drbd_transport_rdma.c
index 5a28f58f0..ce7be2549 100644
--- a/drbd/drbd_transport_rdma.c
+++ b/drbd/drbd_transport_rdma.c
@@ -1936,8 +1936,10 @@ static int dtr_handle_tx_cq_event(struct ib_cq *cq, struct dtr_cm *cm)
err = dtr_repost_tx_desc(cm, tx_desc);
if (!err)
tx_desc = NULL; /* it is in the air again! Fly! */
- else if (__ratelimit(&rdma_transport->rate_limit))
+ else if (__ratelimit(&rdma_transport->rate_limit)) {
tr_warn(transport, "repost of tx_desc failed! %d\n", err);
+ drbd_control_event(transport, CLOSED_BY_PEER);
+ }
}
}
@@ -3234,6 +3236,9 @@ static int dtr_send_page(struct drbd_transport *transport, enum drbd_stream stre
if (err) {
put_page(page);
kfree(tx_desc);
+
+ tr_err(transport, "dtr_post_tx_desc() failed %d\n", err);
+ drbd_control_event(transport, CLOSED_BY_PEER);
}
if (stream == DATA_STREAM)
@@ -3320,6 +3325,9 @@ static int dtr_send_bio_part(struct dtr_transport *rdma_transport,
}
kfree(tx_desc);
}
+
+ tr_err(transport, "dtr_post_tx_desc() failed %d\n", err);
+ drbd_control_event(transport, CLOSED_BY_PEER);
}
return err;
--
2.43.0
More information about the drbd-dev
mailing list