[DRBD-user] Unable to fix out of sync sectors

Wed Jun 11 20:05:51 CEST 2008

On Wed, Jun 11, 2008 at 06:14:08PM +0200, Norbert Tretkowski wrote:
> Am Mittwoch, den 11.06.2008, 14:48 +0200 schrieb Lars Ellenberg:
> > On Wed, Jun 11, 2008 at 12:57:11PM +0200, Norbert Tretkowski wrote:
> > > What's confusing me here is the second line, which tells me there's an
> > > out of sync block, and the last line, which tells me there were 0 KB
> > > marked out of sync.
> > 
> > which exact version of drbd?
> 
> # cat /proc/drbd 
> version: 8.2.6 (api:88/proto:86-88)
> GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by phil at fat-tyre, 2008-05-30 12:59:17
>  0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
>     ns:373458740 nr:0 dw:23941389 dr:1465870696 al:212652 bm:32715 lo:0 pe:0 ua:0 ap:0 oos:0
> 
> > do you start verify on secondary or on primary?
> 
> On the primary.

then just to get an additional data point,
start it from the secondary.

> > context:
> >     there has been a race in the verify code where it could report
> >     in-flight io as out-of sync.  once the in-flight hits the other
> >     node, it gets marked as in sync again. it should be fixed in 8.2.6
> 
> Since I'm using 8.2.6, it shouldn't be my problem.

if you reproducibly get out-of-sync reports which then are in sync
again once the verify is through
  when you start it from primary
  but not when you start it from secondary,
then maybe there is an other race still, or the fix for that race
was incomplete or wrong. just possible.

it also may be a known side effect of the file system.
there are file systems (e.g. reiserfs) that can produce (harmless)
out-of-sync blocks, because in-flight buffers are modified (which is
very bad practice, but legal as long as the fs knows what it is doing)
and later resubmitted.
as long as the file system knows what it is doing (i.e. it does not care
anymore what gets written to the original location, since it re-located
that somewhere else already), it is likely that drbd online verify finds
out-of-sync blocks, because of the different latencies of network
replication and block device.
if these blocks happen to live in a busy part of the device, it is also
likely that the same blocks get written to again, later, so they will be
marked as in-sync once the verify is done and writes the bitmap.

as an experiment, you may want to do an online verify on a completely
idle device (unmounted, not touched by anything), and compare (using dd,
xxd, whatever) by hand any block reported as out-of-sync.

then see what is different: single bits,
few 4-byte chunks, few 8-byte chunks, the whole block?

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please don't Cc me, but send to list -- I'm subscribed