[Drbd-dev] [PATCH] drbd: retry the IO when connection lost

Thu Aug 18 13:47:29 CEST 2022

Hi Joel,
     I have sent a new patch for it. For those requests that are completed, but for which no
barrier ack has yet been received, we can just mark the corresponding block as out of sync.
Then those blocks will be resynced when connection is re-established.
Best regards,
Xu


From: Joel Colledge <joel.colledge at linbit.com>
Date: 2022-08-13 00:33:24
To:  Rui Xu <rui.xu at easystack.cn>
Cc:  Philipp Reisner <philipp.reisner at linbit.com>,drbd-dev at lists.linbit.com,dongsheng.yang at easystack.cn
Subject: Re: [PATCH] drbd: retry the IO when connection lost>Hi Xu,
>
>> And it's simpler than the current mechanism.
>
>It certainly is.
>
>Unfortunately it breaks other things. I will not comment on details of
>the code, but rather on the core architectural concern.
>
>I believe requests that were not yet completed could be retried as you propose.
>
>The difficult requests are those that are completed, but for which no
>barrier ack has yet been received. These requests may not yet have
>been persisted on the peer, even with protocol C. Only once the
>barrier ack has been received do we know that the write has been
>persisted. Until then the peer might lose the write if it crashes.
>
>Until we regain quorum, we do not know what to do with such requests.
>There are 2 possibilities:
>a) It may be that only a network outage occurred. In this case we want
>to resume without a resync.
>b) It may be that the peer crashed. In this case we need to perform a
>resync including the blocks corresponding to these requests.
>
>We keep the requests in the transfer log until we regain quorum, so
>that we know whether we are in situation a) or b).
>
>Your patch assumes that "OK" requests can be assumed to have been
>persisted on the peer:
>+ } else if (req->net_rq_state[idx] & RQ_NET_OK) {
>+ goto barrier_acked;
>
>That is, the patch assumes that situation a) will occur. If b)
>actually occurred, then the necessary blocks will not be resynced and
>this could cause data corruption.
>
>I am very ready to believe that there is a simpler way of dealing with
>suspended requests, but it must handle these different possibilities.
>
>> My test also meet a problem introduced by commit 33600a4632f2.
>> I have three nodes running with drbd9.1(node-1, node-2 and node-3),
>> node-1 is primary and other nodes are secondary. Both quorum and
>> quorum-minimum-redundancy are set to 2.
>
>Indeed, the quorum-minimum-redundancy implementation is now stricter.
>Previously it allowed requests to complete which should not have been
>allowed to. The stricter implementation introduces some tricky corner
>cases which make it hard to use. I recommend that you do not use it
>unless you are really certain that you need it. There may have been
>some confusing recommendations in the past. I recommended it for a
>while. Now it is not recommended to use quorum-minimum-redundancy in
>general.
>
>Best regards,
>Joel


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-dev/attachments/20220818/d2a86791/attachment.htm>