[Drbd-dev] DRBD8: Resync stalled at 100% due to race condition
Philipp Reisner
philipp.reisner at linbit.com
Mon Jun 4 11:08:59 CEST 2007
On Thursday 24 May 2007 14:41:53 Montrose, Ernest wrote:
> Hi all,
> We are seeing a problem where a resync hangs on the SyncSource at the
> end. The SyncTarget finished OK and shows Connected. The signature on
> the SyncSource is:
>
[...]
I think you are right in saying that receive_state() is wrong, but
I have an other interpretation of the logs.
> What I think is happening is that there is a race condition where
> drbd_resync_finished() races receive_state() in this manner:
> 1) The resync finished on drbd0 and we enter drbd_resync_finished().
> But before it can set the stated to Connected, drbd15 which is a higher
> priority device starts syncing. This puts drbd0 in PausedSyncS from
> SyncSource.
Right.
> 2) Drbd_resync_finished() for drbd0 now tries to go to Connected from
> PausedSyncS. The logs below prints this transition but the transition
> was not actually commited since we print before we actually assign the
> new values.
No, the log line "conn (PausedSyncS -> Connected)" is done by
_drbd_set_state() (with the PSC macros) which runs under the req_lock,
and there happens the assignment "mdev->state.i = ns.i;".
The log is okay.
I think we have a race betwen receive_state() assigning the
connection state to nconn=PausedSyncS, then the resync finishes
before we reach the call to spin_lock() (mdev->state.conn = Connected).
Now when receive_state() finally continues, it assigns (the now
obsolete value of) nconn to mdev->state.conn again by calling
_drbd_set_state().
> I include a patch that may at least help illustrate the issue if not fix
> it as I am not sure the req_lock can be held this early without causing
> a deadlock or other perfomance issues.
>
I considered the approach of making drbd_sync_handshake() to run
under the req_lock and to have a simple retry mechanism in receive_state().
Since 8.0.x is not the stable branch I decided to go with that small
patch. (I did not do any testing of this, since I guess it is rather
hard to hit this exact timing...)
Ernest, thanks for pointing this out!
As soon as you agree, I will commit this patch...
-phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix_rsync_hangs.diff
Type: text/x-diff
Size: 2121 bytes
Desc: not available
Url : http://lists.linbit.com/pipermail/drbd-dev/attachments/20070604/6b164887/fix_rsync_hangs.bin
More information about the drbd-dev
mailing list