[PATCH 05/11] drbd_transport_rdma: dont break in dtr_tx_cq_event_handler if (cm->state != DSM_CONNECTED)

Dongsheng Yang dongsheng.yang at easystack.cn
Mon Jul 1 04:23:14 CEST 2024



在 2024/6/28 星期五 下午 8:07, Philipp Reisner 写道:
> Hello Dongsheng,
> 
> It appears that you are trying to fix a leak of cm structures. Is that correct?

Yes, in our network faulure testing, we found drbdadm down command hang 
at dtr_free() to 
wait_event(rdma_transport->cm_count_wait,!atomic_read(&rdma_transport->cm_count));, 


we can find out the leak cm in memory and found the tx_descs_posted is 
not 0. then we did more hacking and found this problem in [05/11]

let's say this case:

a) post two tx_desc and tx_desc_posted to 2.

b) first tx_desc complete and call dtr_tx_cq_event_handler and into 
dtr_handle_tx_cq_event().

c) network failure and dtr_tx_timeout_work_fn() clear CONNECTED.

d) dtr_handle_tx_cq_event() returns, at this time , the second tx_desc 
is already complete, we expect rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP 
| IB_CQ_REPORT_MISSED_EVENTS); to return 1 in rc and continue to call 
dtr_handle_tx_cq_event() in next while loop.

d) but it check cm->state is not CONNECTED, and break the outer while 
loop, so the second tx_desc will never be handled.

> Do you the reference on cm that is held because of the timer?
> Please describe what the problem is, and how you are improving the situation.
> 
> In case this approach is the right solution, the patch should also change the
> dtr_handle_tx_cq_event() function to type void.
> 
> best regards,
>   Philipp
> 
> On Mon, Jun 24, 2024 at 8:22 AM zhengbing.huang
> <zhengbing.huang at easystack.cn> wrote:
>>
>> From: Dongsheng Yang <dongsheng.yang at easystack.cn>
>>
>> We need to drain all tx in disconnect to put all kref for cm
>>
>> Signed-off-by: Dongsheng Yang <dongsheng.yang at easystack.cn>
>> ---
>>   drbd/drbd_transport_rdma.c | 3 ---
>>   1 file changed, 3 deletions(-)
>>
>> diff --git a/drbd/drbd_transport_rdma.c b/drbd/drbd_transport_rdma.c
>> index b7ccb15d4..9a6d15b78 100644
>> --- a/drbd/drbd_transport_rdma.c
>> +++ b/drbd/drbd_transport_rdma.c
>> @@ -1956,9 +1956,6 @@ static void dtr_tx_cq_event_handler(struct ib_cq *cq, void *ctx)
>>                          err = dtr_handle_tx_cq_event(cq, cm);
>>                  } while (!err);
>>
>> -               if (cm->state != DSM_CONNECTED)
>> -                       break;
>> -
>>                  rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
>>                  if (unlikely(rc < 0)) {
>>                          struct drbd_transport *transport = cm->path->path.transport;
>> --
>> 2.27.0
>>
> .
> 


More information about the drbd-dev mailing list