[Drbd-dev] DRBD8: Resync stalled at 100% due to race condition

Philipp Reisner philipp.reisner at linbit.com
Mon Jun 4 11:08:59 CEST 2007


On Thursday 24 May 2007 14:41:53 Montrose, Ernest wrote:
> Hi all,
> We are seeing a problem where a resync hangs on the SyncSource at the
> end.  The SyncTarget finished OK and shows Connected. The signature on
> the SyncSource is:
>
[...]

I think you are right in saying that receive_state() is wrong, but
I have an other interpretation of the logs.

> What I think is happening is that there is a race condition where
> drbd_resync_finished() races receive_state() in this manner:
> 1)  The resync finished on drbd0 and we enter drbd_resync_finished().
> But before it can set the stated to Connected, drbd15 which is a higher
> priority device starts syncing. This puts drbd0 in PausedSyncS from
> SyncSource.

Right.

> 2) Drbd_resync_finished() for drbd0 now tries to go to Connected from
> PausedSyncS.  The logs below prints this transition but the transition
> was not actually commited since we print before we actually assign the
> new values.

No, the log line "conn (PausedSyncS -> Connected)" is done by
_drbd_set_state() (with the PSC macros) which runs under the req_lock, 
and there happens the assignment "mdev->state.i = ns.i;". 
The log is okay.

I think we have a race betwen receive_state() assigning the 
connection state to nconn=PausedSyncS, then the resync finishes
before we reach the call to spin_lock() (mdev->state.conn = Connected).
Now when receive_state() finally continues, it assigns (the now
obsolete value of) nconn to mdev->state.conn again by calling
_drbd_set_state().

> I include a patch that may at least help illustrate the issue if not fix
> it as I am not sure the req_lock can be held this early without causing
> a deadlock or other perfomance issues.
>

I considered the approach of making drbd_sync_handshake() to run
under the req_lock and to have a simple retry mechanism in receive_state().

Since 8.0.x is not the stable branch I decided to go with that small
patch. (I did not do any testing of this, since I guess it is rather
hard to hit this exact timing...)

Ernest, thanks for pointing this out!
As soon as you agree, I will commit this patch...

-phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix_rsync_hangs.diff
Type: text/x-diff
Size: 2121 bytes
Desc: not available
Url : http://lists.linbit.com/pipermail/drbd-dev/attachments/20070604/6b164887/fix_rsync_hangs.bin


More information about the drbd-dev mailing list