From zhengbing.huang at easystack.cn  Thu Apr 17 08:08:19 2025
From: zhengbing.huang at easystack.cn (zhengbing.huang)
Date: Thu, 17 Apr 2025 14:08:19 +0800
Subject: [PATCH] rdma: Fix cm leaks in some abnormal scenarios
Message-ID: <20250417060819.2157347-1-zhengbing.huang@easystack.cn>

In dtr_create_rx_desc() function, if ib_dma_map_single() return an
error, it goes to error code branch, which does not subtract 1
from the reference count of cm.

In dtr_post_tx_desc() function, in the retry code branch, has similar issues.

Signed-off-by: zhengbing.huang <zhengbing.huang at easystack.cn>
---
 drbd/drbd-headers          |  2 +-
 drbd/drbd_transport_rdma.c | 14 ++++++++++----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drbd/drbd-headers b/drbd/drbd-headers
index 94f447251..9188ee14f 160000
--- a/drbd/drbd-headers
+++ b/drbd/drbd-headers
@@ -1 +1 @@
-Subproject commit 94f4472513f351efba5788f783feba6ac6efe9fc
+Subproject commit 9188ee14f6de582a493d260c091db0c655b30d50
diff --git a/drbd/drbd_transport_rdma.c b/drbd/drbd_transport_rdma.c
index 9ce15a0ce..be919a926 100644
--- a/drbd/drbd_transport_rdma.c
+++ b/drbd/drbd_transport_rdma.c
@@ -2080,8 +2080,10 @@ static int dtr_create_rx_desc(struct dtr_flow *flow, gfp_t gfp_mask)
 	rx_desc->sge.addr = ib_dma_map_single(cm->id->device, page_address(page), alloc_size,
 					      DMA_FROM_DEVICE);
 	err = ib_dma_mapping_error(cm->id->device, rx_desc->sge.addr);
-	if (err)
-		goto out;
+	if (err) {
+		tr_err(transport, "ib_dma_map_single() failed %d\n", err);
+		goto out_put;
+	}
 	rx_desc->sge.length = alloc_size;
 
 	atomic_inc(&flow->rx_descs_allocated);
@@ -2094,6 +2096,9 @@ static int dtr_create_rx_desc(struct dtr_flow *flow, gfp_t gfp_mask)
 		dtr_free_rx_desc(rx_desc);
 	}
 	return err;
+
+out_put:
+	kref_put(&cm->kref, dtr_destroy_cm);
 out:
 	kfree(rx_desc);
 	drbd_free_pages(transport, page, 0);
@@ -2396,9 +2401,10 @@ retry:
 		return -EINTR;
 
 	flow = &cm->path->flow[stream];
-	if (atomic_dec_if_positive(&flow->peer_rx_descs) < 0)
+	if (atomic_dec_if_positive(&flow->peer_rx_descs) < 0) {
+		kref_put(&cm->kref, dtr_destroy_cm);
 		goto retry;
-
+	}
 	device = cm->id->device;
 	switch (tx_desc->type) {
 	case SEND_PAGE:
-- 
2.43.0


From zhengbing.huang at easystack.cn  Fri Apr 25 12:24:21 2025
From: zhengbing.huang at easystack.cn (zhengbing.huang)
Date: Fri, 25 Apr 2025 18:24:21 +0800
Subject: [PATCH] rdma: Fix cm leak
Message-ID: <20250425102421.1673048-1-zhengbing.huang@easystack.cn>

We found that when all the DRBDs is down, the reference count
of the drbd_transport_rdma module is still 1.

[root at node-4 ~]# drbdadm status
No currently configured DRBD found.
[root at node-4 ~]# lsmod | grep drbd
drbd_transport_rdma   262144  1

Then, we found an unreleas cm structure and discover
that its state is DSB_CONNECT_REQ + DSB_ERROR.

crash> struct dtr_cm ffff57e515da9400
struct dtr_cm {
  kref = {
    refcount = {
      refs = {
        counter = 1
...
state = 9,
...
}

The scenario of this problem should be like this:
dtr_cma_event_handler() get an RDMA_CM_EVENT_CONNECT_REQUEST event,
and call dtr_cma_accept() to alloc a cm. and set cm->state = DSM_CONNECT_REQ,
now the cm->kref count is 2.
then dtr_cma_event_handler() get xxx_CONNECT_ERROR/xxx_UNREACHABLE/xxx_REJECTED
event, and set_bit(DSB_ERROR, &cm->state).
the cm remove from path in dtr_cma_retry_connect, put one ref.
and cm->state dont has DSB_CONNECTING flag, then return 0.
Now, the cm->kref count is 1, and state is DSB_CONNECT_REQ + DSB_ERROR.

Therefore, when we test the DSB_CONNECTING flag,
we should also test the DSB_CONNECT_REQ flag to avoid cm leak.

Signed-off-by: zhengbing.huang <zhengbing.huang at easystack.cn>
---
 drbd/drbd_transport_rdma.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drbd/drbd_transport_rdma.c b/drbd/drbd_transport_rdma.c
index be919a926..f24440580 100644
--- a/drbd/drbd_transport_rdma.c
+++ b/drbd/drbd_transport_rdma.c
@@ -1307,9 +1307,10 @@ static int dtr_cma_event_handler(struct rdma_cm_id *cm_id, struct rdma_cm_event
 		set_bit(DSB_ERROR, &cm->state);
 
 		dtr_cma_retry_connect(cm->path, cm);
-		if (!test_and_clear_bit(DSB_CONNECTING, &cm->state))
-			return 0; /* keep ref; __dtr_disconnect_path() won */
-		break;
+		if (test_and_clear_bit(DSB_CONNECTING, &cm->state) ||
+			test_and_clear_bit(DSB_CONNECT_REQ, &cm->state))
+			break;
+		return 0; /* keep ref; __dtr_disconnect_path() won */
 
 	case RDMA_CM_EVENT_DISCONNECTED:
 		// pr_info("%s: RDMA_CM_EVENT_DISCONNECTED\n", cm->name);
@@ -2787,7 +2788,8 @@ static void __dtr_disconnect_path(struct dtr_path *path)
 	 * events. Destroy the cm and cm_id to avoid leaking it.
 	 * This is racing with the event delivery, which drops a reference.
 	 */
-	if (test_and_clear_bit(DSB_CONNECTING, &cm->state))
+	if (test_and_clear_bit(DSB_CONNECTING, &cm->state) ||
+		test_and_clear_bit(DSB_CONNECT_REQ, &cm->state))
 		kref_put(&cm->kref, dtr_destroy_cm);
 
 	kref_put(&cm->kref, dtr_destroy_cm);
-- 
2.43.0


From splc.regional.east at gmail.com  Thu Apr 24 17:22:52 2025
From: splc.regional.east at gmail.com (Reginald Cirque)
Date: Thu, 24 Apr 2025 11:22:52 -0400
Subject: Possible memory leak in DRBD 8.4.11
Message-ID: <CANA-72DMWg-UwGBVbM9y-p9zUJu_4LqZk3V9qOEZCM0nSHzq=Q@mail.gmail.com>

Good day,
I was syncing a 300 GB LVM volume from a DRBD primary to a newly-built
secondary, and noticed that the sending host (primary) had 300G of
"untracked", used, memory (not visible in slab, cached, or associated
with any application(s), simply shown as "kernel dynamic memory" in
"smem -twk" output) for long (many hours) after the sync had
completed, suggesting that DRBD buffers/page-pool were not reclaimed.

When I ran "drbdsetup down" to disconnect the secondary, I observed a
kernel log message:
"block drbd3: net_ee not empty, killed 291226 entries", which further
suggests to me that DRBD buffers are not being properly reclaimed.

The memory was returned back to the system ~instantly after
disconnecting the secondary.

I am running Linux kernel 6.1.128-1.el8.x86_64 and patching-in the
8.4.11 DRBD module in-tree.