[DRBD-user] Digest integrity check FAILED. Broken NICs? (DRBD 8.2.4)

Lars Ellenberg lars.ellenberg at linbit.com
Tue Jan 22 20:19:56 CET 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Tue, Jan 22, 2008 at 05:56:55PM +0000, Paul Court wrote:
> Hello,
> 
> I'll give a bit of background info first, so please forgive my ramblings.
> 
> I am testing a DRBD/MySQL database setup on a pair of dell PowerEdge 2850's.
> 
> A few weeks ago I set them up with Ubuntu 7.10 server (32bit) MySQL, 
> DRBD, and Hearbeat. However, I had a "OOM Killer" problem which I cannot 
> identify so I stripped the machines down and have reinstalled both with 
> Ubuntu 7.10 (64bit), MySQL and DRBD(8.2.4) (No heartbeat yet - just 
> adding one thing at a time).
> 
> This afternoon I noticed my mysqld_safe process was hogging the CPU so I 
> killed it off and restarted it, but while I was digging around in the 
> logs I found that DRBD has disconnected itself twice today! The Primary 
> just logs the disconnection, but on the Secondary the disconnection 
> seems to start with these to entries:-
> 
> ---
> [435735.425149] drbd0: Digest integrity check FAILED. Broken NICs?
> [435735.425190] drbd0: error receiving Data, l: 4140!
> ---
> 
> The servers are connected with a Gigabit cable directly between the two 
> nics (not cross-over, I understand the gigabit spec includes auto 
> crossover "magic"!). The cable seems in good physical health.
> 
> I have attached some files:- The /var/log/messages and syslog from both 
> Primary and Secondary; my drbd.conf; and a few repetitions of /proc/drbd 
> after I manually disconnected, generated a 1GB file on Primary and then 
> reconnected, to test the sync speed. I work it out to be 88MB second, 
> which I think is quite good??? and suggests the network is otherwise 
> working OK.
> 
> It seems to take less than a second for the whole 
> error/disconnect/reconnect/resync, none the less - this still seems like 
> a bug to me. Can anyone help me in tracking down the problem?

you ask drbd to enable the "data-integrity feature",
which prepends each data block with its digest (you configured sha1,
which is overkill here, md5 or even crc32 would do fine) before
sending them over the wire.
the receiving side then calculates a digest of that data block
using the same algorithm, and naturally, this re-calculated digest,
and the digest transfered with the data block should match exactly.

if they don't, you see this message:
> Jan 22 16:03:43 mysql-02 kernel: [435735.425149] drbd0: Digest integrity check FAILED. Broken NICs?

what happens is:
 1)  [orig.DATA]
 2)  [orig.sha1][orig.DATA]
 3)  tcp checksum calculation, transmission
 4)  packet is received, tcp checksum is fine
 5)   [received sha1][received DATA]
 6)   [re-calculated sha1 of (received DATA)] does not match [received sha1]

this can happen if either orig.DATA or orig.sha1 is modified after
orig.sha1 was calculated.
examples given below have all been observed in real life.

 - bit flip (in either sha1 or data) on the way from main memory to NIC
   (which would go undetected by tcp checksum when you have offloading
   enabled)
 - bit flip on the way from NIC to main memory (the same)
 - any form of corruption due to a race condition or bug
   in NIC firmware or driver 
 - bit flip/random corruption by some reassembling network compenent
   along the way
   (not in your case, as I understand you use a direct passive link)
 - the application (when using direct-io),
   respectively the file system, re-using (modifying) the write buffer
   while it is in flight, without waiting for the write to complete first
   (unlikely, but we start to believe that we may have evidence
    this does indeed happen under certain circumstances)
 - bug in drbd miscalculating stuff
   (would show up more often)

in any case, drbd recognizes that the data received on the Secondary
is not the data originally handed to it on the Primary side,
complains, disconnects, reconnects, resyncs, done.

this is not a bug but a feature.
and it just ensured your data integrity a number of times.

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list