[Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent

Montrose, Ernest Ernest.Montrose at stratus.com
Mon Nov 26 15:43:11 CET 2007

Well as it turned out, my last idea did not completely fix the issue
So it may be that my description and staging of it was not complete. I
will try to completely describe the problem and then test your patch.  I
will get back to you with the results as soon as I get them. 

FYI, your last idea also introduced an issue where a sync would stall
for ever if paused and resumed quickly. If you checked it in somewhere,
you might want to back out.



-----Original Message-----
From: Philipp Reisner [mailto:philipp.reisner at linbit.com] 
Sent: Monday, November 26, 2007 9:32 AM
To: drbd-dev at linbit.com
Cc: Ernest Montrose; Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: incorrect state transition Connected
->WFBitMapS and UpToDate->Inconsistent

On Friday 16 November 2007 03:36:19 Ernest Montrose wrote:
> Phil,
> I tested the patch and unfortunately it does not fix the race
> though I believe it fixes the ASSERT issues.
> Essentially, when the resync is done and we are in
> if we pause the device then we send the state to the peer.  The peer
> done syncing at that point.  He does a sync_hanshake() that sends its
> to WBItMapS and Pdsk=Inconsistent (since the target has not changed
> state to connected and UptoDate yet. When resync_finished is done we
> Uptodate and connected and we're stuck.
> I tested yet another idea which seems to close the racy window.  I
> drbd_resync_finished into two parts.  A cleanup part that the worker
> schedule to do the clean up and a done part that
> changes the state right away when the resync is done.  I include an
> untested patch to illustrate that idea.

Hi Ernest,

Finally the attached patch made it into the GIT repository
(see 3a57119417c46c51dd4bc720ab7dbf14228f05bb.git.txt)
It is slightly different from the patch I suggested at first

As you could not confirm that the bugs is closed for you I tried
to reproduce it here now (with the attached patch) and some
instrumentation code to make the call to drbd_resync_finished()
to last 10 seconds.

I tested pausing and continuing on the SyncTarget and on the
Sync Source side. I could not find any issues:

[42949590.920000] drbd0: conn( StartingSyncT -> WFSyncUUID )
[42949590.920000] drbd0: conn( WFSyncUUID -> SyncTarget )
[42949590.920000] drbd0: Began resync as SyncTarget (will sync 262244 KB
[65561 bits set]).
[42949590.920000] drbd0: Writing meta data super block now.
[42949669.270000] drbd0: Warn long sleep start
[42949672.120000] drbd0: conn( SyncTarget -> PausedSyncT ) user_isp( 0
-> 1 )
[42949672.120000] drbd0: Resync suspended
[42949678.610000] drbd0: conn( PausedSyncT -> SyncTarget ) user_isp( 1
-> 0 )
[42949678.610000] drbd0: Syncer continues.
[42949679.280000] drbd0: Warn long sleep stop
[42949679.280000] drbd0: Resync done (total 87 sec; paused 6 sec; 3236
[42949679.280000] drbd0: conn( SyncTarget -> Connected ) disk(
Inconsistent -> UpToDate )
[42949679.280000] drbd0: Writing meta data super block now.

Ernest, can you please confirm that this issue is solved for you with
patch, or provide logfile output of an failing test ?


: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :

More information about the drbd-dev mailing list