[Drbd-dev] [PATCH 2/2] drbd: delay resync start unless source has transferred to L_SYNC_SOURCE
Zhang Duan
duan.zhang at easystack.cn
Wed Nov 18 09:46:21 CET 2020
drbd_start_resync may be rescheduled due to down_trylock failure, leaves a
state of L_WF_BITMAP_S while target state is L_SYNC_TARGET and already has
sent its resync request. Then, resync going on while the source is
L_WF_BITMAP_S
will lead to data lose by time sequence below:
L_WF_BITMAP_S L_SYNC_TARGET
resync request(sector A)
reply old data(A) read & write old data(A)
new IO(A)
send oos(A) set oos(A)
A is at new version resync write A done
set in sync(A) but A is at old version
Signed-off-by: ZhangDuan <duan.zhang at easystack.cn>
---
drbd/drbd_receiver.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git drbd/drbd_receiver.c drbd/drbd_receiver.c
index a31e44b2..7a9ce4d0 100644
--- drbd/drbd_receiver.c
+++ drbd/drbd_receiver.c
@@ -3301,6 +3301,15 @@ static int receive_DataRequest(struct
drbd_connection *connection, struct packet
return ignore_remaining_packet(connection, pi->size);
}
+ /* Tell target to have a retry, waiting for the rescheduled
+ * drbd_start_resync to complete. Otherwise the concurrency
+ * of send oos and resync may lead to a data lose. */
+ if ((pi->cmd == P_RS_DATA_REQUEST || pi->cmd == P_CSUM_RS_REQUEST) &&
+ peer_device->repl_state[NOW] == L_WF_BITMAP_S) {
+ drbd_send_ack_rp(peer_device, P_RS_CANCEL, p);
+ return ignore_remaining_packet(connection, pi->size);
+ }
+
peer_req = drbd_alloc_peer_req(peer_device, GFP_TRY);
err = -ENOMEM;
if (!peer_req)
--
2.24.0.windows.2
--
Sincerely Yours,
Zhang Duan
More information about the drbd-dev
mailing list