[Drbd-dev] DRBD8: incorrect state transition Connected ->WFBitMapS and UpToDate->Inconsistent

Montrose, Ernest Ernest.Montrose at stratus.com
Fri Nov 30 01:01:49 CET 2007


Phil,
Sorry it took me a while to get to this but I am still able to reproduce
the problem.  It's either:
1) I am using older code as we are unable to get latest code base for
now. OR
2) Your testing is a tad different then mine.  I use your two patches.
But I wonder if you actually do the "drbdsetup dev0 pause-sync and
resume-sync" exactly between the time you get the first and last sleep
message. Also be aware that you have to have "syncer" set with
--after=[-1,0,1,2..} for drbd0,1,2 and 3...etc. You would then do an
"invalidate" say on drbd10 then a "pause" "resume" on drbd0.

In my case, I do a :
Drbdsetup /dev/drbd27 invalidate
Then I wait for the message from drbd_resync_finished()
I quickly do :
drbdsetup /dev/drbd0 pause-sync
usleep 1000
drbdsetup /dev/drbd0 resume-sync
usleep 1000
drbdsetup /dev/drbd0 pause-sync
usleep 1000
drbdsetup /dev/drbd0 resume-sync

I actually have a script that does this. 

EM--


-----Original Message-----
From: Philipp Reisner [mailto:philipp.reisner at linbit.com] 
Sent: Monday, November 26, 2007 9:32 AM
To: drbd-dev at linbit.com
Cc: Ernest Montrose; Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: incorrect state transition Connected
->WFBitMapS and UpToDate->Inconsistent

On Friday 16 November 2007 03:36:19 Ernest Montrose wrote:
> Phil,
> I tested the patch and unfortunately it does not fix the race
condition
> though I believe it fixes the ASSERT issues.
> Essentially, when the resync is done and we are in
drbd_resync_finished()
> if we pause the device then we send the state to the peer.  The peer
is
> done syncing at that point.  He does a sync_hanshake() that sends its
state
> to WBItMapS and Pdsk=Inconsistent (since the target has not changed
its
> state to connected and UptoDate yet. When resync_finished is done we
go
> Uptodate and connected and we're stuck.
>
> I tested yet another idea which seems to close the racy window.  I
turned
> drbd_resync_finished into two parts.  A cleanup part that the worker
can
> schedule to do the clean up and a done part that
> changes the state right away when the resync is done.  I include an
> untested patch to illustrate that idea.
>

Hi Ernest,

Finally the attached patch made it into the GIT repository
(see 3a57119417c46c51dd4bc720ab7dbf14228f05bb.git.txt)
It is slightly different from the patch I suggested at first

As you could not confirm that the bugs is closed for you I tried
to reproduce it here now (with the attached patch) and some
instrumentation code to make the call to drbd_resync_finished()
to last 10 seconds.

I tested pausing and continuing on the SyncTarget and on the
Sync Source side. I could not find any issues:

[42949590.920000] drbd0: conn( StartingSyncT -> WFSyncUUID )
[42949590.920000] drbd0: conn( WFSyncUUID -> SyncTarget )
[42949590.920000] drbd0: Began resync as SyncTarget (will sync 262244 KB
[65561 bits set]).
[42949590.920000] drbd0: Writing meta data super block now.
[42949669.270000] drbd0: Warn long sleep start
[42949672.120000] drbd0: conn( SyncTarget -> PausedSyncT ) user_isp( 0
-> 1 )
[42949672.120000] drbd0: Resync suspended
[42949678.610000] drbd0: conn( PausedSyncT -> SyncTarget ) user_isp( 1
-> 0 )
[42949678.610000] drbd0: Syncer continues.
[42949679.280000] drbd0: Warn long sleep stop
[42949679.280000] drbd0: Resync done (total 87 sec; paused 6 sec; 3236
K/sec)
[42949679.280000] drbd0: conn( SyncTarget -> Connected ) disk(
Inconsistent -> UpToDate )
[42949679.280000] drbd0: Writing meta data super block now.

Ernest, can you please confirm that this issue is solved for you with
that
patch, or provide logfile output of an failing test ?

Thanks!

-phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :


More information about the drbd-dev mailing list