Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, I've got a problem on my environnement. I set up my primary server (pacemaker + drbd) which ran alone for a while, and then I added the second server (currently only DRBD). Both server can see each other and /proc/drbd reports "uptodate/uptodate". If I run a verify on that resource (right after the full resync), it reports some blocks out of sync ( generally from 100 to 1500 on my 80GO LVM partition). So I disconnect/connect the slave and oos report 0 block. I run again a verify and some block are still out of sync. What I've notived is that it seems to be almost always the same blocks which are out of sync. I tried to do a full resync multiple times but had the same issue. I also tried to replace the physical secondary server by a virtual machine (in order to check if the issue came from the secondary server) but had the same issue. I then activated "data-integrity-alg crc32c" and got a couple of "Digest mismatch, buffer modified by upper layers during write: 167134312s +4096" in the primary log. I tried on a different network card but got the same errors. My full configuration file: protocol C; meta-disk internal; device /dev/drbd0; disk /dev/sysvg/drbd; handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh xxx at xxx"; out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh xxx at xxx"; fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; } net { cram-hmac-alg "sha1"; shared-secret "drbd"; sndbuf-size 512k; max-buffers 8000; max-epoch-size 8000; verify-alg md5; after-sb-0pri disconnect; after-sb-1pri disconnect; after-sb-2pri disconnect; data-integrity-alg crc32c; } disk { al-extents 3389; fencing resource-only; } syncer { rate 90M; } on host1 { address 10.110.1.71:7799; } on host2 { address 10.110.1.72:7799; } } My OS : Redhat6 2.6.32-431.20.3.el6.x86_64 DRBD version : drbd84-8.4.4-1 ethtool -k eth0 Features for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp-segmentation-offload: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: off large-receive-offload: off ntuple-filters: off receive-hashing: off Secondary server is currently not in the HA (pacemaker) but I don't think this the problem. I have got another HA on 2 physical host with the exact same configuration and drbd/os version (but not same server model) and everything's OK. As the primary server is in production, I can't stop the application (Database) to check if the alerts are false positive. Would you have any advice ? Could it be the primary server which have corrupted block or wrong metadata ? Regards, -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20141010/0a1bca5e/attachment.htm>