[DRBD-user] Block out of sync right after full resync

aurelien panizza karboxylose at gmail.com
Sun Oct 12 23:43:58 CEST 2014

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

Thank you for your reply.
I'm not using a virtual machine (I've used a VM only to check if I got the
same issue and then went back on the physical server).
Do you suggest me I should disable the swap on my server ? I'm using EXT4
How can I check if my primary and my secondary are synchronized using a
command like fsck ? (online)
I found explanation about "data getting modifying in flight" but no
"workaround" or everything on how I can avoid getting out of sync block.



2014-10-10 18:44 GMT+11:00 Lionel Sausin <ls at numerigraphe.com>:

>  "buffer modified by upper layers during write" means whatever sits on
> top of drbd changes data "in flight".
> Please search the list archives, this is a FAQ.
> Swap and some file systems do that - usually it's some kind of
> optimization. I suspect VMWare VMs hosted in ext4 do that too.
> There's probably nothing wrong, but DRBD can't know. You should do your
> data integrity checking on some higher level (fsck for example)
> Lionel.
>
> Le 10/10/2014 01:45, aurelien panizza a écrit :
>
> Hi all,
>
>  I've got a problem on my environnement.
> I set up my primary server (pacemaker + drbd) which ran alone for a while,
> and then I added the second server (currently only DRBD).
> Both server can see each other and /proc/drbd reports "uptodate/uptodate".
> If I run a verify on that resource (right after the full resync), it
> reports some blocks out of sync ( generally from 100 to 1500 on my 80GO LVM
> partition).
> So I disconnect/connect the slave and oos report 0 block.
> I run again a verify and some block are still out of sync. What I've
> notived is that it seems to be almost always the same blocks which are out
> of sync.
> I tried to do a full resync multiple times but had the same issue.
> I also tried to replace the physical secondary server by a virtual machine
> (in order to check if the issue came from the secondary server) but had the
> same issue.
>
>  I then activated "data-integrity-alg crc32c" and got a couple of "Digest
> mismatch, buffer modified by upper layers during write: 167134312s +4096"
> in the primary log.
>
>  I tried on a different network card but got the same errors.
>
>  My full configuration file:
>
>    protocol C;
>   meta-disk internal;
>   device /dev/drbd0;
>   disk /dev/sysvg/drbd;
>
>    handlers {
>          split-brain "/usr/lib/drbd/notify-split-brain.sh xxx at xxx";
>          out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh xxx at xxx";
>          fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>          after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>   }
>
>    net {
>          cram-hmac-alg "sha1";
>          shared-secret "drbd";
>          sndbuf-size 512k;
>          max-buffers 8000;
>          max-epoch-size 8000;
>          verify-alg md5;
>          after-sb-0pri disconnect;
>          after-sb-1pri disconnect;
>          after-sb-2pri disconnect;
>          data-integrity-alg crc32c;
>   }
>
>    disk {
>         al-extents 3389;
>         fencing resource-only;
>   }
>
>    syncer {
>         rate 90M;
>   }
>   on host1 {
>         address 10.110.1.71:7799;
>   }
>   on host2 {
>         address 10.110.1.72:7799;
>   }
> }
>
>  My OS : Redhat6 2.6.32-431.20.3.el6.x86_64
> DRBD version : drbd84-8.4.4-1
>
>  ethtool -k eth0
> Features for eth0:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp-segmentation-offload: on
> udp-fragmentation-offload: off
> generic-segmentation-offload: on
> generic-receive-offload: off
> large-receive-offload: off
> ntuple-filters: off
> receive-hashing: off
>
>
>  Secondary server is currently not in the HA (pacemaker) but I don't
> think this the problem.
> I have got another HA on 2 physical host with the exact same configuration
> and drbd/os version (but not same server model) and everything's OK.
>
>  As the primary server is in production, I can't stop the application
> (Database) to check if the alerts are false positive.
>
>  Would you have any advice ?
> Could it be the primary server which have corrupted block or wrong
> metadata ?
>
>  Regards,
>
>
>
> _______________________________________________
> drbd-user mailing listdrbd-user at lists.linbit.comhttp://lists.linbit.com/mailman/listinfo/drbd-user
>
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20141013/e7299afe/attachment.htm>


More information about the drbd-user mailing list