<div style="line-height:1.7;color:#000000;font-size:14px;font-family:Arial"><div>Hi Joel,</div><div>&nbsp;&nbsp;&nbsp;&nbsp; I have sent a new patch for it. For those requests that are completed, but for which no</div><div>barrier ack has yet been received, we can just mark the corresponding block as out of sync.</div><div>Then those blocks will be resynced when connection is re-established.</div><div>Best regards,</div><div>Xu<br></div><br><br><br><br><div  style="position:relative;zoom:1"></div><br><pre><br>From: Joel Colledge &lt;joel.colledge@linbit.com&gt;

Date: 2022-08-13 00:33:24

To:  Rui Xu &lt;rui.xu@easystack.cn&gt;

Cc:  Philipp Reisner &lt;philipp.reisner@linbit.com&gt;,drbd-dev@lists.linbit.com,dongsheng.yang@easystack.cn

Subject: Re: [PATCH] drbd: retry the IO when connection lost&gt;Hi Xu,

&gt;

&gt;&gt; And it's simpler than the current mechanism.

&gt;

&gt;It certainly is.

&gt;

&gt;Unfortunately it breaks other things. I will not comment on details of

&gt;the code, but rather on the core architectural concern.

&gt;

&gt;I believe requests that were not yet completed could be retried as you propose.

&gt;

&gt;The difficult requests are those that are completed, but for which no

&gt;barrier ack has yet been received. These requests may not yet have

&gt;been persisted on the peer, even with protocol C. Only once the

&gt;barrier ack has been received do we know that the write has been

&gt;persisted. Until then the peer might lose the write if it crashes.

&gt;

&gt;Until we regain quorum, we do not know what to do with such requests.

&gt;There are 2 possibilities:

&gt;a) It may be that only a network outage occurred. In this case we want

&gt;to resume without a resync.

&gt;b) It may be that the peer crashed. In this case we need to perform a

&gt;resync including the blocks corresponding to these requests.

&gt;

&gt;We keep the requests in the transfer log until we regain quorum, so

&gt;that we know whether we are in situation a) or b).

&gt;

&gt;Your patch assumes that "OK" requests can be assumed to have been

&gt;persisted on the peer:

&gt;+ } else if (req-&gt;net_rq_state[idx] &amp; RQ_NET_OK) {

&gt;+ goto barrier_acked;

&gt;

&gt;That is, the patch assumes that situation a) will occur. If b)

&gt;actually occurred, then the necessary blocks will not be resynced and

&gt;this could cause data corruption.

&gt;

&gt;I am very ready to believe that there is a simpler way of dealing with

&gt;suspended requests, but it must handle these different possibilities.

&gt;

&gt;&gt; My test also meet a problem introduced by commit 33600a4632f2.

&gt;&gt; I have three nodes running with drbd9.1(node-1, node-2 and node-3),

&gt;&gt; node-1 is primary and other nodes are secondary. Both quorum and

&gt;&gt; quorum-minimum-redundancy are set to 2.

&gt;

&gt;Indeed, the quorum-minimum-redundancy implementation is now stricter.

&gt;Previously it allowed requests to complete which should not have been

&gt;allowed to. The stricter implementation introduces some tricky corner

&gt;cases which make it hard to use. I recommend that you do not use it

&gt;unless you are really certain that you need it. There may have been

&gt;some confusing recommendations in the past. I recommended it for a

&gt;while. Now it is not recommended to use quorum-minimum-redundancy in

&gt;general.

&gt;

&gt;Best regards,

&gt;Joel

</pre></div><br>