[DRBD-user] Problems with oos Sectors after verify

Mon Mar 22 08:39:38 CET 2010

Hi Lars,

>

thanks for your replies. 

After your first message, I pushed our supplier to replace the complete 
hardware again (except the disks itself) but without any improvement - 
except they probably hate me now and think into a PEBKAC direction :-I.

Lars Ellenberg <lars.ellenberg at ...> writes:

> Anyways, nothing you can "tune away" in DRBD.
> Data _is_ changing.

I am wondering why only a few sectors change every time, even when I 
do loads of write operations (> 100 GByte) before starting the verify 
again.

> if it is changing when it already reached the secondary,
>   but is not yet on local disk,

Could this happen when the secondary is under IO pressure? 

>   or happens to fool the crc,

I tried any of the csum algs, same symptom.

>  -> "silent" data diversion, detected on next verify run.

Just to get the impact straight: At the moment when I switch roles 
and shutdown the secondary afterwards, I will most likely have 
some corrupted data somewhere on the disk, right? 

*urgh*

> This is expected: the "digest" is calculated over the data packets,
> which naturally flows from primary to secondary.

What does the "l:" number mean? Are those bytes? 

[ 4514.958814] block drbd1: Digest integrity check FAILED.
[ 4514.958814] block drbd1: error receiving Data, l: 20508!

[37004.824665] block drbd1: Digest integrity check FAILED.
[37004.824722] block drbd1: error receiving Data, l: 4124!

[116754.075758] block drbd1: Digest integrity check FAILED.
[116754.075811] block drbd1: error receiving Data, l: 4136!

> not block but bytelevel changes, random bit patterns,
> no obvious pattern or grouping

Those devices are ext3 volumes used by Xen domUs. Would it help to
find out to which files the sectors belong? Maybe this gives a hint.

> to detect things that fooled the data-integrity-check, you should use a
> different alg for verify. to detect things that fooled ("collided") the
> csums alg, you should best have integrity, verify, and csums all
> different.

Did that:
                data-integrity-alg md5;
                verify-alg sha1;
                csums-alg crc32c;

Same behavior. 

BTW: The current rate of Digest messages is about one every every 3 hours. 
Shouldnt this be much more in case of a hardware related problem? 

> What is the usage pattern?

domUs (Debian inside, basically LAMP stacks with some extra stuff) with two
devices each (one for system, one for data). The digest and oos errors in 
combination only appear on the system devices. I tried to force the Digest 
error on the data devices by doing heavy IO (tiobench, dd, ...) but failed.  

> If thats all in the _swap_ of the xen domU's, this may even be "legal".

Since I don't want to do live migration, swap partitions are not on drbd 
devices. The only purpose of this setup is to reduce recovery time in case 
of a complete system loss (fatal hardware error etc). Because of the large 
amount of data (approx 2 TB in ), restoring the complete set from a offsite 
backup connected via fibre would take almost 32h.

Thanks for your help and ideas.

Regards,
Henning