[DRBD-user] tons of out-of-sync sectors detected

Lars Ellenberg lars.ellenberg at linbit.com
Wed Jul 30 17:31:30 CEST 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Wed, Jul 30, 2008 at 04:22:02PM +0200, GAUTIER Hervé wrote:
>
> Hi !
>
> Well, can you explain in which context do you use the drbdadm verify  
> command ? As you have created it, you certainly have a way to use it.

where we use the verify feature, e.g. from a cron job during "off hours",
it did not produce any false positives yet.  the (very few) differences
where indeed bit flips, which probably happened on the way from main
memory to storage controller.

did you verify the data by hand, yet?

I mean, when DRBD complains that there are sectors out of sync,
did you read (with O_DIRECT from the lower level storage) and hexdump them,
and compare the results on both nodes?

(how) do they differ?

> If file system may "re-use" an in-flight data buffer even before the  
> write has been completed,

don't just jump to conclusions.
that was one possible explanation.
you have to verify any hypothesis,
and support/confirm/refute it, right?

"normal" file systems don't.

as far as I know neither ext3 nor xfs do,
or at least are not supposed to do that.

we don't use any other, currently.

swap code can do that sometimes.

(you can have the disk image of the swap partition of a virtual machine,
e.g. Xen DomU, on DRBD. that is when you notice these.  Don't put your
host swap on DRBD, that is a very bad idea for obvious reasons.)

> is there any file system option to prevent this?

in case your file system is "broken enough" to not care what actually
gets on disk, it is probably "broken enough" to not provide a means of 
switching off that "feature".
("broken enought": well, that is my personal feeling of brokenness...
 does not mean I am right with this judgement...)

I would consisder it a bug for any file system to modify in-flight buffers.
It is just that it could happen, aparently.
If they do it on purpose (like the swap code), that is ok, though "unexpected".

If they do it "accidentally", then that is a bug in that file system
or in some part of the generic write out path.

Of course it is also possible that you hit some DRBD bug.
There have been races in the early verify code.
There may be more problems.

But as long as _we_ cannot reproduce it, and it is not a substantial
part of the userbase that is complaining, I think it is likely that
there is something "wrong" in your specific setup, which includes the
used kernel version, filesystem as well as hardware components.

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please don't Cc me, but send to list -- I'm subscribed



More information about the drbd-user mailing list