[DRBD-user] DRBD sync starts over again and again

Lars Ellenberg lars.ellenberg at linbit.com
Sat Dec 29 15:18:32 CET 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Sat, Dec 29, 2007 at 10:44:17AM +0100, Harald Rinker wrote:
> Hello Lars,

please avoid html mail (at least on mailing lists).
and top posting is not a good habbit either.

> I use Debian Etch Standard Kernel k713 2.6.18-4-amd64 and k641 2.6.18-5-686
> and first i used drbd8 from backports now 8.0.8 from linbit website compiled as
> module with the kernel headers

in that case, 
 it could be a bug in drbd
   somebody else should have noticed, there are lots of people using
   that kernel, including us, with varying configurations.
 it could be bad communication (broken nic/driver/ram)
   what nic? driver?
     lspci ; ethtool -i 
   offload settings?
     ethtool -k eth0 # or whatever eth you use for drbd traffic
   I suggest to switch off any offloading, then try again:
     ethtool -K eth0 tso off rx off tx off
   when was your last memtest?

> It is possible to use different kernel architecture so far amd64 and i386?

yes. no known problem with connecting i386 and amd64,
we have such clusters in operation ourselves.

> and there you trigger some assertions. "interessting" :)  I don´t understand
> 
> k713 is the primary one and is in production as nfs server
> on k641 i stopped the drbd /etc/init.d/drbd stop
> and restart it with /etc/init.d/drbd start
> 
> the resulting syslog follows
> 
> k641
> Dec 29 10:27:07 k641 kernel: drbd: initialised. Version: 8.0.8 (api:86/proto:86)
> Dec 29 10:27:07 k641 kernel: drbd: GIT-hash: bd3e2c922f95c4fa0dca57a4f8c24bf8b249cc02 build by root at k641, 2007-12-28 01:
> 07:54
> Dec 29 10:27:07 k641 kernel: drbd: registered as block device major 147
> Dec 29 10:27:07 k641 kernel: drbd: minor_table @ 0xc2363740
> Dec 29 10:27:07 k641 kernel: drbd0: disk( Diskless -> Attaching )
> Dec 29 10:27:07 k641 kernel: klogd 1.4.1, ---------- state change ----------
> Dec 29 10:27:07 k641 kernel: drbd0: No usable activity log found.
> Dec 29 10:27:07 k641 kernel: drbd0: max_segment_size ( = BIO size ) = 32768


> Dec 29 10:27:07 k641 kernel: drbd0: drbd_bm_resize called with capacity == 3145631928
> Dec 29 10:27:07 k641 kernel: drbd0: resync bitmap: bits=393203991 words=12287626
> Dec 29 10:27:07 k641 kernel: drbd0: size = 1499 GB (1572815964 KB)

just to verify, please find these three lines
for the other node as well.


> Dec 29 10:27:07 k641 kernel: drbd0: reading of bitmap took 46 jiffies
> Dec 29 10:27:07 k641 kernel: drbd0: recounting of set bits took additional 19 jiffies
> Dec 29 10:27:07 k641 kernel: drbd0: 735 GB (192889111 bits) marked out-of-sync by on disk bit-map.
> Dec 29 10:27:07 k641 kernel: drbd0: disk( Attaching -> Inconsistent )
> Dec 29 10:27:07 k641 kernel: drbd0: Writing meta data super block now.
> Dec 29 10:27:07 k641 kernel: drbd0: conn( StandAlone -> Unconnected )
> Dec 29 10:27:07 k641 kernel: drbd0: receiver (re)started
> Dec 29 10:27:07 k641 kernel: drbd0: conn( Unconnected -> WFConnection )
> Dec 29 10:27:07 k641 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86
> Dec 29 10:27:07 k641 kernel: drbd0: Peer authenticated using 32 bytes of 'sha256' HMAC
> Dec 29 10:27:07 k641 kernel: drbd0: conn( WFConnection -> WFReportParams )
> Dec 29 10:27:07 k641 kernel: drbd0: Becoming sync target due to disk states.
> Dec 29 10:27:07 k641 kernel: drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpT
> oDate )
> Dec 29 10:27:07 k641 kernel: drbd0: Writing meta data super block now.
> Dec 29 10:27:08 k641 kernel: drbd0: conn( WFBitMapT -> WFSyncUUID )
> Dec 29 10:27:08 k641 kernel: drbd0: conn( WFSyncUUID -> SyncTarget )
> Dec 29 10:27:08 k641 kernel: drbd0: Began resync as SyncTarget (will sync 812948580 KB [203237145 bits set]).
> Dec 29 10:27:08 k641 kernel: drbd0: Writing meta data super block now.
> Dec 29 10:27:09 k641 kernel: drbd0: magic?? on data m: 0x0 c: 0 l: 0


> Dec 29 10:27:09 k641 kernel: drbd0: ASSERT( mdev->state.conn < Connected ) in /usr/src/drbd-8.0.8/drbd/drbd_receiver.c:2
> 703
> Dec 29 10:27:09 k641 kernel: drbd0: ASSERT( mdev->state.conn < Connected ) in /usr/src/drbd-8.0.8/drbd/drbd_receiver.c:3
> 436
> Dec 29 10:27:09 k641 kernel: drbd0: asender terminated
> Dec 29 10:27:10 k641 kernel: drbd0: logic bug? rs_left=203236788 > rs_total=0 (rs_failed 0)
> Dec 29 10:27:10 k641 last message repeated 7 times
> Dec 29 10:27:10 k641 kernel: drbd0: resync inactive, but callback triggered??

whoho, we don't like to be interrupted during resync that early.
but these are all harmless, though we should improve on the
cleanup timing, and make it less noisy.

and we aparently even made some progress!
started with 203237145 bits set, and have 203236788 left now,
i.e. we aparently successfully synced 1428 kB.  according to the next
"Began resync" line, "in sync" was successfully communicated back for at
least 1092 kB.

but then we received some packet we consider invalid,
because of "unexpected" content.
your last post contained "all ones",
this syslog excerpt contains "all zeros" were drbd expects
a standard "magic ; command nr ; payload length" header.
this suggest data corruption during transmission.

you definetly should check your network cards for known trouble,
(firmware updates, eeprom hacks etc.).
and you should memcheck, just in case.

> Dec 29 10:27:10 k641 kernel: drbd0: tl_clear()
> Dec 29 10:27:10 k641 kernel: drbd0: Connection closed
> Dec 29 10:27:10 k641 kernel: drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> Unconnected ) pdsk( UpToDate -> DUnkn
> own )
> Dec 29 10:27:10 k641 kernel: drbd0: drbdd_init: (mdev->receiver.t_state != Restarting) in /usr/src/drbd-8.0.8/drbd/drbd_
> receiver.c:3117
> Dec 29 10:27:10 k641 kernel: drbd0: receiver terminated
> Dec 29 10:27:10 k641 kernel: drbd0: receiver (re)started
> Dec 29 10:27:10 k641 kernel: drbd0: conn( Unconnected -> WFConnection )
> Dec 29 10:27:10 k641 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86
> Dec 29 10:27:10 k641 kernel: drbd0: Peer authenticated using 32 bytes of 'sha256' HMAC
> Dec 29 10:27:10 k641 kernel: drbd0: conn( WFConnection -> WFReportParams )
> Dec 29 10:27:10 k641 kernel: drbd0: Writing meta data super block now.
> Dec 29 10:27:10 k641 kernel: drbd0: Becoming sync target due to disk states.
> Dec 29 10:27:10 k641 kernel: drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpT
> oDate )
> Dec 29 10:27:10 k641 kernel: drbd0: Writing meta data super block now.
> Dec 29 10:27:11 k641 kernel: drbd0: conn( WFBitMapT -> WFSyncUUID )
> Dec 29 10:27:11 k641 kernel: drbd0: conn( WFSyncUUID -> SyncTarget )
> Dec 29 10:27:11 k641 kernel: drbd0: Began resync as SyncTarget (will sync 812947488 KB [203236872 bits set]).
> Dec 29 10:27:11 k641 kernel: drbd0: Writing meta data super block now.
> Dec 29 10:27:14 k641 kernel: drbd0: magic?? on data m: 0x0 c: 0 l: 0
> Dec 29 10:27:14 k641 kernel: drbd0: ASSERT( mdev->state.conn < Connected ) in /usr/src/drbd-8.0.8/drbd/drbd_receiver.c:2
> 703
> Dec 29 10:27:14 k641 kernel: drbd0: ASSERT( mdev->state.conn < Connected ) in /usr/src/drbd-8.0.8/drbd/drbd_receiver.c:3
> 436
> Dec 29 10:27:14 k641 kernel: drbd0: asender terminated
> Dec 29 10:27:14 k641 kernel: drbd0: logic bug? rs_left=203233107 > rs_total=0 (rs_failed 0)
> Dec 29 10:27:14 k641 last message repeated 4 times
> Dec 29 10:27:14 k641 kernel: drbd0: resync inactive, but callback triggered??
> Dec 29 10:27:14 k641 kernel: drbd0: tl_clear()
> Dec 29 10:27:14 k641 kernel: drbd0: Connection closed
> Dec 29 10:27:14 k641 kernel: drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> Unconnected ) pdsk( UpToDate -> DUnkn
> own )
> Dec 29 10:27:14 k641 kernel: drbd0: drbdd_init: (mdev->receiver.t_state != Restarting) in /usr/src/drbd-8.0.8/drbd/drbd_
> receiver.c:3117
> 
> 
> 
> 
> k713
> Dec 29 10:27:06 k713 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86
> Dec 29 10:27:06 k713 kernel: drbd0: Peer authenticated using 32 bytes of 'sha256' HMAC
> Dec 29 10:27:06 k713 kernel: drbd0: conn( WFConnection -> WFReportParams )
> Dec 29 10:27:06 k713 kernel: drbd0: Becoming sync source due to disk states.
> Dec 29 10:27:06 k713 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
> Dec 29 10:27:06 k713 kernel: drbd0: Writing meta data super block now.
> Dec 29 10:27:07 k713 kernel: drbd0: conn( WFBitMapS -> SyncSource )
> Dec 29 10:27:07 k713 kernel: drbd0: Began resync as SyncSource (will sync 812948580 KB [203237145 bits set]).
> Dec 29 10:27:07 k713 kernel: drbd0: Writing meta data super block now.
> Dec 29 10:27:08 k713 kernel: drbd0: meta connection shut down by peer.
> Dec 29 10:27:08 k713 kernel: drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
> Dec 29 10:27:08 k713 kernel: drbd0: asender terminated
> Dec 29 10:27:08 k713 kernel: drbd0: sock was reset by peer
> Dec 29 10:27:08 k713 kernel: drbd0: short read expecting header on sock: r=-104
> Dec 29 10:27:08 k713 kernel: drbd0: tl_clear()
> Dec 29 10:27:08 k713 kernel: drbd0: Connection closed
> Dec 29 10:27:08 k713 kernel: drbd0: Writing meta data super block now.
> Dec 29 10:27:08 k713 kernel: drbd0: conn( NetworkFailure -> Unconnected )
> Dec 29 10:27:08 k713 kernel: drbd0: receiver terminated
> Dec 29 10:27:08 k713 kernel: drbd0: receiver (re)started
> Dec 29 10:27:08 k713 kernel: drbd0: conn( Unconnected -> WFConnection )
> Dec 29 10:27:08 k713 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86
> Dec 29 10:27:08 k713 kernel: drbd0: Peer authenticated using 32 bytes of 'sha256' HMAC
> Dec 29 10:27:08 k713 kernel: drbd0: conn( WFConnection -> WFReportParams )
> Dec 29 10:27:08 k713 kernel: drbd0: Becoming sync source due to disk states.
> Dec 29 10:27:08 k713 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
> Dec 29 10:27:09 k713 kernel: drbd0: Writing meta data super block now.
> Dec 29 10:27:10 k713 kernel: drbd0: conn( WFBitMapS -> SyncSource )
> Dec 29 10:27:10 k713 kernel: drbd0: Began resync as SyncSource (will sync 812947488 KB [203236872 bits set]).
> Dec 29 10:27:10 k713 kernel: drbd0: Writing meta data super block now.
> Dec 29 10:27:12 k713 kernel: drbd0: meta connection shut down by peer.
> Dec 29 10:27:12 k713 kernel: drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
> Dec 29 10:27:12 k713 kernel: drbd0: asender terminated
> Dec 29 10:27:12 k713 kernel: drbd0: sock was reset by peer
> Dec 29 10:27:12 k713 kernel: drbd0: short read expecting header on sock: r=-104
> Dec 29 10:27:12 k713 kernel: drbd0: tl_clear()
> Dec 29 10:27:12 k713 kernel: drbd0: Connection closed
> Dec 29 10:27:12 k713 kernel: drbd0: Writing meta data super block now.
> Dec 29 10:27:13 k713 kernel: drbd0: conn( NetworkFailure -> Unconnected )
> Dec 29 10:27:13 k713 kernel: drbd0: receiver terminated


-- 
: Lars Ellenberg                            Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :



More information about the drbd-user mailing list