[Drbd-dev] Stuck in WFBitMapS again

Montrose, Ernest Ernest.Montrose at stratus.com
Mon Mar 31 19:12:04 CEST 2008


Phil,
It's been a while here too.  I had put this on the back burner but I am
glad
you have some time to take a look at this. This has been a real sticky
problem for us.

We are seeing a couple of instances of this race condition.  In one case
we go WFBitMapS/WFBitMapT  and stay there.  In the other case we go
Connected/WFBitMapS. 

I include two sets of logs here.  In one log I had some instrumentations
when the problem happened so don't be confused by that. I hope these
help.

Thanks again.

EM--
-----Original Message-----
From: Philipp Reisner [mailto:philipp.reisner at linbit.com] 
Sent: Monday, March 31, 2008 10:44 AM
To: Montrose, Ernest
Cc: drbd-dev at linbit.com
Subject: Re: [Drbd-dev] Stuck in WFBitMapS again

Am Mittwoch, 13. Februar 2008 00:55:51 schrieb Montrose, Ernest:
> It looks like we still have a race condition while finishing syncing
and
> pausing/resuming syncing at the same time.
>
> What's happening is that we finish syncing.
> The secondary side goes to Connected at the end of
> drbd_resync_finished()
>
> The Primary side is still in drbd_resync_finished().  While still
there,
>
> the syncer isp flag gets set and unset due to a higher priority device
> syncer getting paused/resumed.
>
> Now the Primary side sends "Inconsistent" state to the Secondary since
> we are still in drbd_resync_finished()
> and have not gone Connected and UptoDate yet. But the other side is
> connected and UpToDate though.
>
> The secondary goes WFBitMapS via drbd_sync_handshake()
>
> In the mean time the Primary finally goes Connected and UptoDate.
>
> The Secondary is stuck in WFBitMapS.
>
> I include a patch that may be the fix for this.  Any other ideas?
>
> Thanks.
> EM--


Hi Ernest,

I know it is quite some time now, but I had to remove this patch.
It breaks manual split brain resolution.
Commit ab7facf4e17d98558d2569ccf954f2c64eb7b650

That was the wrong way of fixing this.

Do you have something with logfiles of that event

-Phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :
-------------- next part --------------
DRBD device gets stuck in WFBitMapS one side and WFBitMapT the other
====================================================================
Feb 24 23:48:34 node0 kernel: drbd27: conn( Unconnected -> WFConnection )
Feb 24 23:48:35 node0 kernel: drbd27: conn( WFConnection -> WFReportParams )
Feb 24 23:48:35 node0 kernel: drbd27: Handshake successful: DRBD Network Protocol version 86
Feb 24 23:48:35 node0 kernel: drbd27: Starting asender thread (from drbd27_receiver [25427])
Feb 24 23:48:35 node0 kernel: drbd27: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
Feb 24 23:48:35 node0 kernel: drbd27: Writing meta data super block now.
Feb 24 23:48:36 node0 kernel: drbd27: role( Secondary -> Primary )
Feb 24 23:48:36 node0 kernel: drbd27: Writing meta data super block now.
Feb 24 23:59:07 node0 kernel: drbd27: PingAck did not arrive in time.
Feb 24 23:59:07 node0 kernel: drbd27: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Feb 24 23:59:07 node0 kernel: drbd27: asender terminated
Feb 24 23:59:07 node0 kernel: drbd27: Terminating asender thread
Feb 24 23:59:07 node0 kernel: drbd27: Creating new current UUID
Feb 24 23:59:07 node0 kernel: drbd27: short read expecting header on sock: r=-512
Feb 24 23:59:07 node0 kernel: drbd27: tl_clear()
Feb 24 23:59:07 node0 kernel: drbd27: Connection closed
Feb 24 23:59:07 node0 kernel: drbd27: Writing meta data super block now.
Feb 24 23:59:07 node0 kernel: drbd27: conn( NetworkFailure -> Unconnected )
Feb 24 23:59:07 node0 kernel: drbd27: receiver terminated
Feb 24 23:59:07 node0 kernel: drbd27: receiver (re)started
Feb 24 23:59:07 node0 kernel: drbd27: conn( Unconnected -> WFConnection )
Feb 25 00:15:33 node0 kernel: drbd27: conn( WFConnection -> WFReportParams )
Feb 25 00:15:33 node0 kernel: drbd27: aftr_isp( 0 -> 1 )
Feb 25 00:15:33 node0 kernel: drbd27: aftr_isp( 1 -> 0 )
Feb 25 00:15:33 node0 kernel: drbd27: Handshake successful: DRBD Network Protocol version 86
Feb 25 00:15:33 node0 kernel: drbd27: Starting asender thread (from drbd27_receiver [25427])
Feb 25 00:15:33 node0 kernel: drbd27: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Feb 25 00:15:33 node0 kernel: drbd27: Writing meta data super block now.
Feb 25 00:15:43 node0 kernel: drbd27: aftr_isp( 0 -> 1 )
Feb 25 00:15:43 node0 kernel: drbd27: peer_isp( 0 -> 1 )
Feb 25 00:15:43 node0 kernel: drbd27: aftr_isp( 1 -> 0 )
Feb 25 00:15:43 node0 kernel: drbd27: peer_isp( 1 -> 0 )

NODE1:
Feb 25 00:15:24 node1 kernel: drbd27: disk( Diskless -> Attaching )
Feb 25 00:15:24 node1 kernel: drbd27: Starting worker thread (from cqueue/0 [7665])
Feb 25 00:15:24 node1 kernel: drbd27: No usable activity log found.
Feb 25 00:15:24 node1 kernel: drbd27: max_segment_size ( = BIO size ) = 32768
Feb 25 00:15:24 node1 kernel: drbd27: drbd_bm_resize called with capacity == 12582456
Feb 25 00:15:24 node1 kernel: drbd27: resync bitmap: bits=1572807 words=49152
Feb 25 00:15:24 node1 kernel: drbd27: size = 6143 MB (6291228 KB)
Feb 25 00:15:24 node1 kernel: drbd27: reading of bitmap took 1 jiffies
Feb 25 00:15:24 node1 kernel: drbd27: recounting of set bits took additional 0 jiffies
Feb 25 00:15:24 node1 kernel: drbd27: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Feb 25 00:15:24 node1 kernel: drbd27: disk( Attaching -> UpToDate )
Feb 25 00:15:24 node1 kernel: drbd27: Writing meta data super block now.
Feb 25 00:15:24 node1 kernel: drbd27: conn( StandAlone -> Unconnected )
Feb 25 00:15:24 node1 kernel: drbd27: Starting receiver thread (from drbd27_worker [16484])
Feb 25 00:15:24 node1 kernel: drbd27: receiver (re)started
Feb 25 00:15:24 node1 kernel: drbd27: conn( Unconnected -> WFConnection )
Feb 25 00:15:34 node1 kernel: drbd27: aftr_isp( 0 -> 1 )
Feb 25 00:15:34 node1 kernel: drbd27: aftr_isp( 1 -> 0 )
Feb 25 00:15:34 node1 kernel: drbd27: Handshake successful: DRBD Network Protocol version 86
Feb 25 00:15:34 node1 kernel: drbd27: conn( WFConnection -> WFReportParams )
Feb 25 00:15:34 node1 kernel: drbd27: Starting asender thread (from drbd27_receiver [16491])
Feb 25 00:15:34 node1 kernel: drbd27: peer( Unknown -> Primary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
Feb 25 00:15:34 node1 kernel: drbd27: Writing meta data super block now.
Feb 25 00:15:34 node1 kernel: drbd27: peer_isp( 1 -> 0 )
Feb 25 00:15:35 node1 kernel: drbd27: unexpected cstate (Connected) in receive_bitmap
Feb 25 00:15:35 node1 kernel: drbd27: conn( Connected -> WFBitMapT )
Feb 25 00:15:44 node1 kernel: drbd27: aftr_isp( 0 -> 1 )
Feb 25 00:15:44 node1 kernel: drbd27: peer_isp( 0 -> 1 )
Feb 25 00:15:45 node1 kernel: drbd27: aftr_isp( 1 -> 0 )
Feb 25 00:15:45 node1 kernel: drbd27: peer_isp( 1 -> 0 )




Still  another case getting Connected one side and WFBitMapS the other.
===========================================================================
NODE0:

Mar  5 22:31:59 node0 kernel: drbd5: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
Mar  5 22:31:59 node0 kernel: drbd5: Began resync as SyncTarget (will sync 17628 KB [4407 bits set]).
Mar  5 22:31:59 node0 kernel: drbd5: Writing meta data super block now.
Mar  5 22:32:03 node0 kernel: drbd5: conn( SyncTarget -> PausedSyncT ) aftr_isp( 0 -> 1 )
Mar  5 22:32:03 node0 kernel: drbd5: Resync suspended
Mar  5 22:32:03 node0 kernel: drbd5: receive_state:EM--===== oconn(PausedSyncT) peer_state.conn(Connected) peer_state.disk(UpToDate) state.disk(Inconsistent)===
Mar  5 22:32:03 node0 kernel: drbd5: peer_isp( 0 -> 1 )
Mar  5 22:32:03 node0 kernel: drbd5: Resync done (total 1 sec; paused 0 sec; 17628 K/sec)
Mar  5 22:32:03 node0 kernel: drbd5:  uuid[Bitmap] now 98B2279CF244A665
Mar  5 22:32:03 node0 kernel: drbd5:  uuid[History_start] now F29410AEC0882498
Mar  5 22:32:03 node0 kernel: drbd5:  uuid[History_end] now 19FFEC1E2A6CCAAB
Mar  5 22:32:04 node0 kernel: drbd5:  uuid[History_start] now 98B2279CF244A665
Mar  5 22:32:04 node0 kernel: drbd5:  uuid[Bitmap] now 71E86A3E2A741DB6
Mar  5 22:32:04 node0 kernel: drbd5:  uuid[Current] now B392489356D183BC
Mar  5 22:32:04 node0 kernel: drbd5:  uuid[History_start] now 71E86A3E2A741DB6
Mar  5 22:32:04 node0 kernel: drbd5:  uuid[Bitmap] now 0000000000000000
Mar  5 22:32:04 node0 kernel: drbd5: conn( PausedSyncT -> Connected ) disk( Inconsistent -> UpToDate )
Mar  5 22:32:04 node0 kernel: drbd5: Writing meta data super block now.
Mar  5 22:32:05 node0 kernel: drbd5: receive_state:EM--===== oconn(Connected) peer_state.conn(WFBitMapS) peer_state.disk(UpToDate) state.disk(UpToDate)===
Mar  5 22:32:05 node0 kernel: drbd5: rct = 0 in /sandbox/emontros/devel/r1-ipod/platform/drbd/src/dist/BUILD/drbd-8.0.11/drbd/drbd_receiver.c:1937
Mar  5 22:32:05 node0 kernel: drbd5: drbd_sync_handshake:
Mar  5 22:32:05 node0 kernel: drbd5: self B392489356D183BC:0000000000000000:71E86A3E2A741DB6:98B2279CF244A665
Mar  5 22:32:05 node0 kernel: drbd5: peer B392489356D183BD:0000000000000000:71E86A3E2A741DB6:98B2279CF244A665
Mar  5 22:32:05 node0 kernel: drbd5: uuid_compare()=0 by rule 4
Mar  5 22:32:05 node0 kernel: drbd5: unexpected cstate (Connected) in receive_bitmap
Mar  5 22:32:06 node0 kernel: drbd5: receive_state:EM--===== oconn(Connected) peer_state.conn(WFBitMapS) peer_state.disk(UpToDate) state.disk(UpToDate)===
Mar  5 22:32:06 node0 kernel: drbd5: rct = 0 in /sandbox/emontros/devel/r1-ipod/platform/drbd/src/dist/BUILD/drbd-8.0.11/drbd/drbd_receiver.c:1937
Mar  5 22:32:06 node0 kernel: drbd5: drbd_sync_handshake:
Mar  5 22:32:06 node0 kernel: drbd5: self B392489356D183BC:0000000000000000:71E86A3E2A741DB6:98B2279CF244A665
Mar  5 22:32:06 node0 kernel: drbd5: peer B392489356D183BD:0000000000000000:71E86A3E2A741DB6:98B2279CF244A665
Mar  5 22:32:06 node0 kernel: drbd5: uuid_compare()=0 by rule 4
Mar  5 22:32:06 node0 kernel: drbd5: No resync, but 262127 bits in bitmap!
Mar  5 22:32:06 node0 kernel: drbd5: peer_isp( 1 -> 0 )
Mar  5 22:32:06 node0 kernel: drbd5: aftr_isp( 1 -> 0 )
Mar  5 22:32:07 node0 kernel: drbd5: receive_state:EM--===== oconn(Connected) peer_state.conn(WFBitMapS) peer_state.disk(UpToDate) state.disk(UpToDate)===
Mar  5 22:32:07 node0 kernel: drbd5: rct = 0 in /sandbox/emontros/devel/r1-ipod/platform/drbd/src/dist/BUILD/drbd-8.0.11/drbd/drbd_receiver.c:1937
Mar  5 22:32:07 node0 kernel: drbd5: drbd_sync_handshake:
Mar  5 22:32:07 node0 kernel: drbd5: self B392489356D183BC:0000000000000000:71E86A3E2A741DB6:98B2279CF244A665
Mar  5 22:32:07 node0 kernel: drbd5: peer B392489356D183BD:0000000000000000:71E86A3E2A741DB6:98B2279CF244A665
Mar  5 22:32:07 node0 kernel: drbd5: uuid_compare()=0 by rule 4
Mar  5 22:32:08 node0 kernel: drbd5: No resync, but 262127 bits in bitmap!



NODE1:

Mar  5 22:32:09 node1 kernel: drbd5: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent )
Mar  5 22:32:10 node1 kernel: drbd5: Began resync as SyncSource (will sync 17628 KB [4407 bits set]).
Mar  5 22:32:10 node1 kernel: drbd5: Writing meta data super block now.
Mar  5 22:32:16 node1 kernel: drbd5: Resync done (total 1 sec; paused 0 sec; 17628 K/sec)
Mar  5 22:32:16 node1 kernel: drbd5:  uuid[History_start] now 71E86A3E2A741DB6
Mar  5 22:32:16 node1 kernel: drbd5:  uuid[Bitmap] now 0000000000000000
Mar  5 22:32:16 node1 kernel: drbd5: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Mar  5 22:32:16 node1 kernel: drbd5: Writing meta data super block now.
Mar  5 22:32:16 node1 kernel: drbd5: aftr_isp( 0 -> 1 )
Mar  5 22:32:17 node1 kernel: drbd5: receive_state:EM--===== oconn(Connected) peer_state.conn(PausedSyncT) peer_state.disk(Inconsistent) state.disk(UpToDate)===
Mar  5 22:32:17 node1 kernel: drbd5: drbd_sync_handshake:
Mar  5 22:32:17 node1 kernel: drbd5: self B392489356D183BD:0000000000000000:71E86A3E2A741DB6:98B2279CF244A665
Mar  5 22:32:17 node1 kernel: drbd5: peer 71E86A3E2A741DB6:0000000000000000:F29410AEC0882499:19FFEC1E2A6CCAAB
Mar  5 22:32:17 node1 kernel: drbd5: uuid_compare()=2 by rule 8
Mar  5 22:32:17 node1 kernel: drbd5: Becoming sync source due to disk states.
Mar  5 22:32:17 node1 kernel: drbd5: Writing meta data super block now.
Mar  5 22:32:17 node1 kernel: drbd5: writing of bitmap took 0 jiffies
Mar  5 22:32:17 node1 kernel: drbd5: 1023 MB (262127 bits) marked out-of-sync by on disk bit-map.
Mar  5 22:32:17 node1 kernel: drbd5: Writing meta data super block now.
Mar  5 22:32:17 node1 kernel: drbd5: conn( Connected -> WFBitMapS ) pdsk( UpToDate -> Inconsistent ) peer_isp( 0 -> 1 )
Mar  5 22:32:17 node1 kernel: drbd5: Writing meta data super block now.
Mar  5 22:32:17 node1 kernel: drbd5: receive_state:EM--===== oconn(WFBitMapS) peer_state.conn(Connected) peer_state.disk(UpToDate) state.disk(UpToDate)===
Mar  5 22:32:17 node1 kernel: drbd5: drbd_sync_handshake:
Mar  5 22:32:17 node1 kernel: drbd5: self B392489356D183BD:0000000000000000:71E86A3E2A741DB6:98B2279CF244A665
Mar  5 22:32:17 node1 kernel: drbd5: peer 71E86A3E2A741DB6:0000000000000000:F29410AEC0882499:19FFEC1E2A6CCAAB
Mar  5 22:32:18 node1 kernel: drbd5: uuid_compare()=2 by rule 8
Mar  5 22:32:18 node1 kernel: drbd5: Writing meta data super block now.
Mar  5 22:32:18 node1 kernel: drbd5: writing of bitmap took 1 jiffies
Mar  5 22:32:18 node1 kernel: drbd5: 1023 MB (262127 bits) marked out-of-sync by on disk bit-map.
Mar  5 22:32:18 node1 kernel: drbd5: Writing meta data super block now.
Mar  5 22:32:18 node1 kernel: drbd5: pdsk( Inconsistent -> UpToDate )
Mar  5 22:32:18 node1 kernel: drbd5: Writing meta data super block now.
Mar  5 22:32:18 node1 kernel: drbd5: aftr_isp( 1 -> 0 )
Mar  5 22:32:18 node1 kernel: drbd5: receive_state:EM--===== oconn(WFBitMapS) peer_state.conn(Connected) peer_state.disk(UpToDate) state.disk(UpToDate)===
Mar  5 22:32:18 node1 kernel: drbd5: drbd_sync_handshake:
Mar  5 22:32:18 node1 kernel: drbd5: self B392489356D183BD:0000000000000000:71E86A3E2A741DB6:98B2279CF244A665
Mar  5 22:32:18 node1 kernel: drbd5: peer 71E86A3E2A741DB6:0000000000000000:F29410AEC0882499:19FFEC1E2A6CCAAB
Mar  5 22:32:18 node1 kernel: drbd5: uuid_compare()=2 by rule 8
Mar  5 22:32:18 node1 kernel: drbd5: Writing meta data super block now.
Mar  5 22:32:18 node1 kernel: drbd5: writing of bitmap took 0 jiffies
Mar  5 22:32:18 node1 kernel: drbd5: 1023 MB (262127 bits) marked out-of-sync by on disk bit-map.
Mar  5 22:32:18 node1 kernel: drbd5: Writing meta data super block now.
Mar  5 22:32:18 node1 kernel: drbd5: peer_isp( 1 -> 0 )





More information about the drbd-dev mailing list