Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Lars, > thanks for your replies. After your first message, I pushed our supplier to replace the complete hardware again (except the disks itself) but without any improvement - except they probably hate me now and think into a PEBKAC direction :-I. Lars Ellenberg <lars.ellenberg at ...> writes: > Anyways, nothing you can "tune away" in DRBD. > Data _is_ changing. I am wondering why only a few sectors change every time, even when I do loads of write operations (> 100 GByte) before starting the verify again. > if it is changing when it already reached the secondary, > but is not yet on local disk, Could this happen when the secondary is under IO pressure? > or happens to fool the crc, I tried any of the csum algs, same symptom. > -> "silent" data diversion, detected on next verify run. Just to get the impact straight: At the moment when I switch roles and shutdown the secondary afterwards, I will most likely have some corrupted data somewhere on the disk, right? *urgh* > This is expected: the "digest" is calculated over the data packets, > which naturally flows from primary to secondary. What does the "l:" number mean? Are those bytes? [ 4514.958814] block drbd1: Digest integrity check FAILED. [ 4514.958814] block drbd1: error receiving Data, l: 20508! [37004.824665] block drbd1: Digest integrity check FAILED. [37004.824722] block drbd1: error receiving Data, l: 4124! [116754.075758] block drbd1: Digest integrity check FAILED. [116754.075811] block drbd1: error receiving Data, l: 4136! > not block but bytelevel changes, random bit patterns, > no obvious pattern or grouping Those devices are ext3 volumes used by Xen domUs. Would it help to find out to which files the sectors belong? Maybe this gives a hint. > to detect things that fooled the data-integrity-check, you should use a > different alg for verify. to detect things that fooled ("collided") the > csums alg, you should best have integrity, verify, and csums all > different. Did that: data-integrity-alg md5; verify-alg sha1; csums-alg crc32c; Same behavior. BTW: The current rate of Digest messages is about one every every 3 hours. Shouldnt this be much more in case of a hardware related problem? > What is the usage pattern? domUs (Debian inside, basically LAMP stacks with some extra stuff) with two devices each (one for system, one for data). The digest and oos errors in combination only appear on the system devices. I tried to force the Digest error on the data devices by doing heavy IO (tiobench, dd, ...) but failed. > If thats all in the _swap_ of the xen domU's, this may even be "legal". Since I don't want to do live migration, swap partitions are not on drbd devices. The only purpose of this setup is to reduce recovery time in case of a complete system loss (fatal hardware error etc). Because of the large amount of data (approx 2 TB in ), restoring the complete set from a offsite backup connected via fibre would take almost 32h. Thanks for your help and ideas. Regards, Henning