[DRBD-user] What causes nodes to become out-of-sync?

Lars Ellenberg lars.ellenberg at linbit.com
Thu Jul 24 15:16:30 CEST 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Wed, Jul 23, 2008 at 12:13:35PM -0700, Jeffrey Froman wrote:
> On Tuesday 22 July 2008 10:54:32 am Lars Ellenberg wrote:
> > > > do you use tcp checksum offloading?
> > > > any other tcp offloading?
> <snip>
> > ethtool -k ethX
> 
> "ethtool -k eth2" output is as follows:
> 
> Offload parameters for eth2:
> Cannot get device udp large send offload settings: Operation not 
> supported
> Cannot get device generic segmentation offload settings: Operation not 
> supported
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: on
> udp fragmentation offload: off
> generic segmentation offload: off
> 
> So it looks like we are using some offloading, and that it's possible 
> to adjust these settings via ethtool.
> 
> > the better "end-to-end" your checksums are,
> > the more likely they will detect transfer errors.
> >
> > so we have that data-integrity-alg in drbd, which puts a
> > drbd-to-drbd checksum on the data
> 
> Thanks for the clear and detailed explanation. Do I understand 
> correctly that using data-integrity-alg at the drbd layer is still 
> the most reliable checksum, regardless of tcp offloading? In other 
> words, is there any reason to adjust offloading settings if 
> data-integrity-alg is enabled?
> 
> Also, do I understand correctly that failing hash values cause a 
> packet (or block?) to be invisibly re-sent? Or is some other action 
> required in the event that a hash comparison fails?

whenever drbd "data-integrity-alg" comparison fails,
drbd disconnects, reconnects, and does a resync.

if you get that frequently, you basically have a degraded cluster.

if you turn tcp checksum offloading off, thus have the tcp checksum
cover in-kernel-memory to in-kernel-memory, any transfer errors in
between (catched by the tcp checksum, which may be too weak to catch all
of them -- but then you have a different problem anyways)
cause a tcp resent, which is completely transparent to drbd.

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please don't Cc me, but send to list -- I'm subscribed



More information about the drbd-user mailing list