[DRBD-user] Block out of sync right after full resync

Fri Oct 10 01:45:35 CEST 2014

Hi all,

I've got a problem on my environnement.
I set up my primary server (pacemaker + drbd) which ran alone for a while,
and then I added the second server (currently only DRBD).
Both server can see each other and /proc/drbd reports "uptodate/uptodate".
If I run a verify on that resource (right after the full resync), it
reports some blocks out of sync ( generally from 100 to 1500 on my 80GO LVM
partition).
So I disconnect/connect the slave and oos report 0 block.
I run again a verify and some block are still out of sync. What I've
notived is that it seems to be almost always the same blocks which are out
of sync.
I tried to do a full resync multiple times but had the same issue.
I also tried to replace the physical secondary server by a virtual machine
(in order to check if the issue came from the secondary server) but had the
same issue.

I then activated "data-integrity-alg crc32c" and got a couple of "Digest
mismatch, buffer modified by upper layers during write: 167134312s +4096"
in the primary log.

I tried on a different network card but got the same errors.

My full configuration file:

  protocol C;
  meta-disk internal;
  device /dev/drbd0;
  disk /dev/sysvg/drbd;

  handlers {
         split-brain "/usr/lib/drbd/notify-split-brain.sh xxx at xxx";
         out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh xxx at xxx";
         fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
         after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
  }

  net {
         cram-hmac-alg "sha1";
         shared-secret "drbd";
         sndbuf-size 512k;
         max-buffers 8000;
         max-epoch-size 8000;
         verify-alg md5;
         after-sb-0pri disconnect;
         after-sb-1pri disconnect;
         after-sb-2pri disconnect;
         data-integrity-alg crc32c;
  }

  disk {
        al-extents 3389;
        fencing resource-only;
  }

  syncer {
        rate 90M;
  }
  on host1 {
        address 10.110.1.71:7799;
  }
  on host2 {
        address 10.110.1.72:7799;
  }
}

My OS : Redhat6 2.6.32-431.20.3.el6.x86_64
DRBD version : drbd84-8.4.4-1

ethtool -k eth0
Features for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off
ntuple-filters: off
receive-hashing: off

Secondary server is currently not in the HA (pacemaker) but I don't think
this the problem.
I have got another HA on 2 physical host with the exact same configuration
and drbd/os version (but not same server model) and everything's OK.

As the primary server is in production, I can't stop the application
(Database) to check if the alerts are false positive.

Would you have any advice ?
Could it be the primary server which have corrupted block or wrong metadata
?

Regards,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20141010/0a1bca5e/attachment.htm>