[DRBD-user] one of drbd disk need full resync after reboot

Lars Ellenberg lars.ellenberg at linbit.com
Tue Jun 2 16:15:17 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Sat, May 30, 2009 at 11:39:53AM +0700, chatchai jantaraprim wrote:
> Hi,
>      Our servers using debian lenny, recently we just upgrade the
> kernel on our servers, and also in the process the drbd module.
> The current kernel vesion is linux 2.6.26-2-686, when we restart one
> of the server for maintainance one of the drbd disk
> did full resync. We seen this twice. For the last one we have this in the syslog
> 
> May 28 13:21:06 nnksvr01 kernel: [  145.948274] drbd: initialised.
> Version: 8.0.14 (api:86/proto:86)

may or may not be fixed in 8.0.16.
may or may not be fixed in 8.0.git
may or may not be fixed in 8.3(.git).

as you apparently are able to reproduce this,
could you verify the above?

if you can still reproduce,
would you be so kind and give the minimum procedure to reproduce?

> May 28 13:21:06 nnksvr01 kernel: [  145.948274] drbd: GIT-hash:
> bb447522fc9a87d0069b7e14f0234911ebdab0f7 build by phil at fat-tyre,
> 2008-11-12 16:40:33
> May 28 13:21:06 nnksvr01 kernel: [  145.948274] drbd: registered as
> block device major 147
> May 28 13:21:06 nnksvr01 kernel: [  145.948274] drbd: minor_table @ 0xf796e3c0
> May 28 13:21:07 nnksvr01 kernel: [  147.176534] drbd6: disk( Diskless
> -> Attaching )

try to avoid line breaks when pasting log files.
they are horrilbe to read this way.

> May 28 13:21:07 nnksvr01 kernel: [  147.176534] drbd6: Starting worker
> thread (from cqueue [2609])
> May 28 13:21:07 nnksvr01 kernel: [  147.265232] drbd6: Found 6
> transactions (234 active extents) in activity log.
> May 28 13:21:07 nnksvr01 kernel: [  147.265238] drbd6: Backing
> device's merge_bvec_fn() = f8c9c069
> May 28 13:21:07 nnksvr01 kernel: [  147.265241] drbd6:
> max_segment_size ( = BIO size ) = 4096
> May 28 13:21:07 nnksvr01 kernel: [  147.265244] drbd6: Adjusting my
> ra_pages to backing device's (32 -> 64)
> May 28 13:21:07 nnksvr01 kernel: [  147.265248] drbd6: drbd_bm_resize
> called with capacity == 356181824
> May 28 13:21:07 nnksvr01 kernel: [  147.266925] drbd6: resync bitmap:
> bits=44522728 words=1391336
> May 28 13:21:07 nnksvr01 kernel: [  147.266930] drbd6: size = 170 GB
> (178090912 KB)
> May 28 13:21:07 nnksvr01 kernel: [  147.868100] drbd6: recounting of
> set bits took additional 2 jiffies
> May 28 13:21:07 nnksvr01 kernel: [  147.868107] drbd6: 0 KB (0 bits)
> marked out-of-sync by on disk bit-map.
> May 28 13:21:07 nnksvr01 kernel: [  147.868112] drbd6: disk( Attaching
> -> UpToDate )
> May 28 13:21:07 nnksvr01 kernel: [  147.998236] drbd6: conn(
> StandAlone -> Unconnected )
> May 28 13:21:07 nnksvr01 kernel: [  148.002321] drbd6: Starting
> receiver thread (from drbd6_worker [2667])
> May 28 13:21:07 nnksvr01 kernel: [  148.002362] drbd6: receiver (re)started
> May 28 13:21:07 nnksvr01 kernel: [  148.002362] drbd6: conn(
> Unconnected -> WFConnection )
> May 28 13:21:08 nnksvr01 kernel: [  148.169608] drbd6: Handshake
> successful: DRBD Network Protocol version 86
> May 28 13:21:08 nnksvr01 kernel: [  148.169608] drbd6: conn(
> WFConnection -> WFReportParams )
> May 28 13:21:08 nnksvr01 kernel: [  148.169608] drbd6: Starting
> asender thread (from drbd6_receiver [2693])
> May 28 13:21:08 nnksvr01 kernel: [  148.185007] drbd6: drbd_sync_handshake:
> May 28 13:21:08 nnksvr01 kernel: [  148.185007] drbd6: self
> F7668094BF050518:0000000000000000:E56A0B0B69CA4098:B4B5CF85BFA1686C
> May 28 13:21:08 nnksvr01 kernel: [  148.185007] drbd6: peer
> DA26B785F0FD01C9:F7668094BF050519:E56A0B0B69CA4099:B4B5CF85BFA1686C
> May 28 13:21:08 nnksvr01 kernel: [  148.185007] drbd6:
> uuid_compare()=-1 by rule 5
> May 28 13:21:08 nnksvr01 kernel: [  148.194847] drbd6: peer( Unknown
> -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown ->
> UpToDate )
> May 28 13:21:08 nnksvr01 kernel: [  148.860198] drbd6: conn( WFBitMapT
> -> WFSyncUUID )
> May 28 13:21:08 nnksvr01 kernel: [  148.888532] drbd6: conn(
> WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
> May 28 13:21:08 nnksvr01 kernel: [  148.888542] drbd6: Began resync as
> SyncTarget (will sync 4 KB [1 bits set]).
> May 28 13:21:08 nnksvr01 kernel: [  149.340302] drbd6: conn(
> SyncTarget -> PausedSyncT ) peer_isp( 0 -> 1 )
> May 28 13:21:08 nnksvr01 kernel: [  149.340302] drbd6: Resync suspended
> May 28 13:21:08 nnksvr01 kernel: [  149.346065] drbd6: aftr_isp( 0 -> 1 )
> May 28 13:21:08 nnksvr01 kernel: [  149.408303] drbd6: Resync done
> (total 1 sec; paused 0 sec; 4 K/sec)
> May 28 13:21:08 nnksvr01 kernel: [  149.408303] drbd6: conn(
> PausedSyncT -> Connected ) disk( Inconsistent -> UpToDate )
> May 28 13:21:09 nnksvr01 kernel: [  150.488580] drbd6: unexpected
> cstate (Connected) in receive_bitmap

this may be a race condition in drbd.
we will double-check.

> May 28 13:21:09 nnksvr01 kernel: [  150.504885] drbd6: drbd_sync_handshake:
> May 28 13:21:09 nnksvr01 kernel: [  150.504885] drbd6: self
> DA26B785F0FD01C8:0000000000000000:2951B01A84735B7A:F7668094BF050519
> May 28 13:21:09 nnksvr01 kernel: [  150.504885] drbd6: peer
> DA26B785F0FD01C9:0000000000000000:2951B01A84735B7B:F7668094BF050519
> May 28 13:21:09 nnksvr01 kernel: [  150.504885] drbd6:
> uuid_compare()=0 by rule 4
> May 28 13:21:09 nnksvr01 kernel: [  150.504885] drbd6: No resync, but
> 44522728 bits in bitmap!

there.
why do you have that many bits (in fact: all bits) set?

> 
> For some reason we got these
> 
>       May 28 13:21:09 nnksvr01 kernel: [  150.488580] drbd6:
> unexpected cstate (Connected) in receive_bitmap
>       May 28 13:21:09 nnksvr01 kernel: [  150.504885] drbd6:
> drbd_sync_handshake:
>       May 28 13:21:09 nnksvr01 kernel: [  150.504885] drbd6: self
> DA26B785F0FD01C8:0000000000000000:2951B01A84735B7A:F7668094BF050519
>       May 28 13:21:09 nnksvr01 kernel: [  150.504885] drbd6: peer
> DA26B785F0FD01C9:0000000000000000:2951B01A84735B7B:F7668094BF050519
>       May 28 13:21:09 nnksvr01 kernel: [  150.504885] drbd6:
> uuid_compare()=0 by rule 4
>       May 28 13:21:09 nnksvr01 kernel: [  150.504885] drbd6: No
> resync, but 44522728 bits in bitmap!
> 
> and then when this server reboot, this drbd need full resync
> The configuration for this drbd6 part is
> ---------------------------------------------------------------------------
> ...
> resource r6 {
>   protocol C;
> 
>   handlers {
>     pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
>     pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
>     local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
>     outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
>   }
> 
>   startup {
>     degr-wfc-timeout 120;
>   }
> 
>   disk {
>     on-io-error detach;
>     no-disk-flushes;
>     no-md-flushes;
>   }
> 
>   net {
>   }
> 
>   syncer {
>     rate 100M;
>     after "r5";
>     al-extents 257;
>   }
> 
>   on nnksvr01 {
>     device     /dev/drbd6;
>     disk       /dev/md7;
>     address    10.0.0.1:7794;
>     meta-disk  internal;
>   }
> 
>   on nnksvr02 {
>     device    /dev/drbd6;
>     disk      /dev/md7;
>     address   10.0.0.2:7794;
>     meta-disk internal;
>   }
> }

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list