<div dir="ltr">Hi all,<div><br></div><div>I&#39;ve got a problem on my environnement.</div><div>I set up my primary server (pacemaker + drbd) which ran alone for a while, and then I added the second server (currently only DRBD).</div><div>Both server can see each other and /proc/drbd reports &quot;uptodate/uptodate&quot;.</div><div>If I run a verify on that resource (right after the full resync), it reports some blocks out of sync ( generally from 100 to 1500 on my 80GO LVM partition).</div><div>So I disconnect/connect the slave and oos report 0 block.</div><div>I run again a verify and some block are still out of sync. What I&#39;ve notived is that it seems to be almost always the same blocks which are out of sync.</div><div>I tried to do a full resync multiple times but had the same issue.</div><div>I also tried to replace the physical secondary server by a virtual machine (in order to check if the issue came from the secondary server) but had the same issue.</div><div><br></div><div>I then activated &quot;data-integrity-alg crc32c&quot; and got a couple of &quot;Digest mismatch, buffer modified by upper layers during write: 167134312s +4096&quot; in the primary log.</div><div><br></div><div>I tried on a different network card but got the same errors.</div><div><br></div><div>My full configuration file:</div><div><br></div><div>  protocol C;</div><div>  meta-disk internal;</div><div>  device /dev/drbd0;</div><div>  disk /dev/sysvg/drbd;</div><div><br></div><div>  handlers {</div><div>         split-brain &quot;/usr/lib/drbd/notify-split-brain.sh xxx@xxx&quot;;</div><div>         out-of-sync &quot;/usr/lib/drbd/notify-out-of-sync.sh xxx@xxx&quot;;</div><div>         fence-peer &quot;/usr/lib/drbd/crm-fence-peer.sh&quot;;</div><div>         after-resync-target &quot;/usr/lib/drbd/crm-unfence-peer.sh&quot;;</div><div>  }</div><div><br></div><div>  net {</div><div>         cram-hmac-alg &quot;sha1&quot;;</div><div>         shared-secret &quot;drbd&quot;;</div><div>         sndbuf-size 512k;</div><div>         max-buffers 8000;</div><div>         max-epoch-size 8000;</div><div>         verify-alg md5;</div><div>         after-sb-0pri disconnect;</div><div>         after-sb-1pri disconnect;</div><div>         after-sb-2pri disconnect;</div><div>         data-integrity-alg crc32c;</div><div>  }</div><div><br></div><div>  disk {</div><div>        al-extents 3389;</div><div>        fencing resource-only;</div><div>  }</div><div><br></div><div>  syncer {</div><div>        rate 90M;</div><div>  }</div><div>  on host1 {</div><div>        address <a href="http://10.110.1.71:7799">10.110.1.71:7799</a>;</div><div>  }</div><div>  on host2 {</div><div>        address <a href="http://10.110.1.72:7799">10.110.1.72:7799</a>;</div><div>  }</div><div>}</div><div><br></div><div>My OS : Redhat6 2.6.32-431.20.3.el6.x86_64</div><div>DRBD version : drbd84-8.4.4-1</div><div><br></div><div><div>ethtool -k eth0</div><div>Features for eth0:</div><div>rx-checksumming: on</div><div>tx-checksumming: on</div><div>scatter-gather: on</div><div>tcp-segmentation-offload: on</div><div>udp-fragmentation-offload: off</div><div>generic-segmentation-offload: on</div><div>generic-receive-offload: off</div><div>large-receive-offload: off</div><div>ntuple-filters: off</div><div>receive-hashing: off</div></div><div><br></div><div><br></div><div>Secondary server is currently not in the HA (pacemaker) but I don&#39;t think this the problem.</div><div>I have got another HA on 2 physical host with the exact same configuration and drbd/os version (but not same server model) and everything&#39;s OK.</div><div><br></div><div>As the primary server is in production, I can&#39;t stop the application (Database) to check if the alerts are false positive.</div><div><br></div><div>Would you have any advice ?</div><div>Could it be the primary server which have corrupted block or wrong metadata ?</div><div><br></div><div>Regards,</div><div><br></div></div>