[DRBD-user] data-integrity-alg in dual-Primary setup

Fri Sep 17 14:40:20 CEST 2010

On Fri, Sep 17, 2010 at 11:22:53AM +0200, Fabrice Charlier wrote:
> Hi,
> 
> Three months ago we deployed a "web cluster" for LAMP hosting. We
> based this solution on drdb in active/active mode combined with
> ocfs2. This solution matches correctly our needs but several times (
> ~ once a month) a problem appears: we have enable the
> "data-integrity-alg" option to (try to) avoid silent corruption of
> data and several times, this feature detected that data have been
> altered during transit between the two nodes. As we have
> active/active nodes and as we use automatic split brain recovery
> policies proposed in official documentation, the two members of the
> mirror are disconnected and we have to resync it manually to
> continue normal operation. We have already disabled all TCP
> offloading capabilities of all NICs without success.
> 
> Is it possible to ask drdb to retry to send the block until success
> in this kind of situation?

No.
It is likely one of those cases where in-flight buffers are changed.
Problem is known since a long time, but recently has drawn some
more attention again,
http://lwn.net/Articles/399148/
http://www.spinics.net/lists/linux-scsi/msg44074.html

Maybe disable the drbd level checksum, and trust the TCP checksum.

> If not, are you planning to implement this feature?

Maybe.
But it does not have particular priority.
Maybe we rather wait for the VM and VFS layer to fix it for us.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed