[DRBD-user] Resizing DRBD/LVM, stuck in WFSyncUUID

Thu May 7 09:45:11 CEST 2009

On Wed, May 06, 2009 at 04:22:35PM -0700, Mike Sweetser - Adhost wrote:
> Hello:
> 
> I'm doing some testing of resizing a LVM-based DRBD partition.  I've
> successfully resized the LVM, and then resized the partition via DRBDADM
> (the partition is named part3)
> 
> drbdadm resize part3
> 
> I see the following on the Primary server, and everything is OK.
> 
> May  6 16:15:19 SERVER1 kernel: drbd4: drbd_bm_resize called with
> capacity == 31457280
> May  6 16:15:19 SERVER1 kernel: drbd4: resync bitmap: bits=3932160
> words=122880
> May  6 16:15:19 SERVER1 kernel: drbd4: size = 15 GB (15728640 KB)
> May  6 16:15:42 SERVER1 kernel: drbd4: Writing the whole bitmap, size
> changed
> 
> However, I see this on the Secondary server, and it's stuck in
> WFSyncUUID:
> 
> May  6 16:15:18 SERVER2 kernel: drbd4: drbd_bm_resize called with
> capacity == 31457280
> May  6 16:15:18 SERVER2 kernel: drbd4: resync bitmap: bits=3932160
> words=122880
> May  6 16:15:18 SERVER2 kernel: drbd4: size = 15 GB (15728640 KB)
> May  6 16:15:18 SERVER2 kernel: drbd4: Writing the whole bitmap, size
> changed
> May  6 16:15:19 SERVER2 kernel: drbd4: writing of bitmap took 1376
> jiffies
> May  6 16:15:19 SERVER2 kernel: drbd4: 10 GB (2621440 bits) marked
> out-of-sync by on disk bit-map.
> May  6 16:15:19 SERVER2 kernel: drbd4: Writing meta data super block
> now.
> May  6 16:15:19 SERVER2 kernel: drbd4: No resync, but 2621440 bits in
> bitmap!
> May  6 16:15:19 SERVER2 kernel: drbd4: bm_set was 2621440, corrected to
> 2621472. /usr/local/src/drbd-8.2.6/drbd/drbd_receiver.c:2144
>
> May  6 16:15:19 SERVER2 kernel: drbd4: Resync of new storage after
> online grow
> May  6 16:15:19 SERVER2 kernel: drbd4: conn( Connected -> WFSyncUUID )
> 
> Seven minutes later, it's still in WFSyncUUID on the Secondary.
> 
> Am I missing a step?  Is something possibly configured wrong on my end?
> Help? :)

there has been an unlikely but possible "wait-for-ever" condition in
some versions of DRBD if the connection or resync handshake happens
while there is IO in flight.
get out of WFSync*: try drbdadm disconnect, then reconnect.
if the disconnect does not work, cut the tcp connection by other means
(e.g. iptables reject, or ifdown)

workaround to make the race impossible:
	drbdadm suspend-io
	do-the-interessting-stuff-here
	drbdadm resume-io

fix:
upgrade to 8.3 ;)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed