Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all,
I've got a problem on my environnement.
I set up my primary server (pacemaker + drbd) which ran alone for a while,
and then I added the second server (currently only DRBD).
Both server can see each other and /proc/drbd reports "uptodate/uptodate".
If I run a verify on that resource (right after the full resync), it
reports some blocks out of sync ( generally from 100 to 1500 on my 80GO LVM
partition).
So I disconnect/connect the slave and oos report 0 block.
I run again a verify and some block are still out of sync. What I've
notived is that it seems to be almost always the same blocks which are out
of sync.
I tried to do a full resync multiple times but had the same issue.
I also tried to replace the physical secondary server by a virtual machine
(in order to check if the issue came from the secondary server) but had the
same issue.
I then activated "data-integrity-alg crc32c" and got a couple of "Digest
mismatch, buffer modified by upper layers during write: 167134312s +4096"
in the primary log.
I tried on a different network card but got the same errors.
My full configuration file:
protocol C;
meta-disk internal;
device /dev/drbd0;
disk /dev/sysvg/drbd;
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh xxx at xxx";
out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh xxx at xxx";
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
net {
cram-hmac-alg "sha1";
shared-secret "drbd";
sndbuf-size 512k;
max-buffers 8000;
max-epoch-size 8000;
verify-alg md5;
after-sb-0pri disconnect;
after-sb-1pri disconnect;
after-sb-2pri disconnect;
data-integrity-alg crc32c;
}
disk {
al-extents 3389;
fencing resource-only;
}
syncer {
rate 90M;
}
on host1 {
address 10.110.1.71:7799;
}
on host2 {
address 10.110.1.72:7799;
}
}
My OS : Redhat6 2.6.32-431.20.3.el6.x86_64
DRBD version : drbd84-8.4.4-1
ethtool -k eth0
Features for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off
ntuple-filters: off
receive-hashing: off
Secondary server is currently not in the HA (pacemaker) but I don't think
this the problem.
I have got another HA on 2 physical host with the exact same configuration
and drbd/os version (but not same server model) and everything's OK.
As the primary server is in production, I can't stop the application
(Database) to check if the alerts are false positive.
Would you have any advice ?
Could it be the primary server which have corrupted block or wrong metadata
?
Regards,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20141010/0a1bca5e/attachment.htm>