[DRBD-user] Digest integrity check FAILED (was: ASSERT FAILED: drbd_al_read_log)

Walter Haidinger walter.haidinger at gmx.at
Tue Feb 15 17:41:30 CET 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Well, I've updated both nodes to 2.6.35.11 and 8.3.10 running CentOS 5.5.

The good news:
I did a create-md and a full resync on the secondary node.
No more ASSERTs logged. :-)

The bad news:
The integrity checks still fails. Just the logs for today:
Feb 15 02:49:35 drbd0: Digest integrity check FAILED: 713210072s +4096
Feb 15 06:16:33 drbd0: Digest integrity check FAILED: 713142808s +4096
Feb 15 07:10:39 drbd0: Digest integrity check FAILED: 713049088s +4096
Feb 15 08:47:22 drbd0: Digest integrity check FAILED: 713119656s +4096
Feb 15 09:15:24 drbd0: Digest integrity check FAILED: 713215448s +4096
Feb 15 10:11:01 drbd0: Digest integrity check FAILED: 713232072s +4096
Feb 15 11:12:44 drbd0: Digest integrity check FAILED: 713239944s +4096
Feb 15 11:30:40 drbd0: Digest integrity check FAILED: 713106328s +4096
Feb 15 11:36:22 drbd0: Digest integrity check FAILED: 713151800s +4096
Feb 15 11:40:22 drbd0: Digest integrity check FAILED: 713166384s +4096
Feb 15 13:55:41 drbd0: Digest integrity check FAILED: 713138680s +4096
Feb 15 15:11:14 drbd0: Digest integrity check FAILED: 713189472s +4096

Searching the list shows the possible reasons:
http://lists.linbit.com/pipermail/drbd-user/2008-January/008343.html

 - bit flip (in either sha1 or data) on the way from main memory to NIC
   (which would go undetected by tcp checksum when you have offloading
   enabled)
 - bit flip on the way from NIC to main memory (the same)
 - any form of corruption due to a race condition or bug
   in NIC firmware or driver 
 - bit flip/random corruption by some reassembling network compenent
   along the way
   (not in your case, as I understand you use a direct passive link)
 - the application (when using direct-io),
   respectively the file system, re-using (modifying) the write buffer
   while it is in flight, without waiting for the write to complete first
   (unlikely, but we start to believe that we may have evidence
    this does indeed happen under certain circumstances)
 - bug in drbd miscalculating stuff
   (would show up more often)

Now, I probably can rule out any NIC problems after transferring a
couple of TiB using nc (over plain TCP just like DRBD) in both directions.
Every sha1 checksum of the 100 GiB test-file was ok...

The NICs are directly connected by a crossover-cable, so no switch involved.

This leaves just the last two possibilities, right?
How can I test or debug them further?

If you need any information regarding my setup, please let me know.

Regards,
Walter
-- 
Schon gehört? GMX hat einen genialen Phishing-Filter in die
Toolbar eingebaut! http://www.gmx.net/de/go/toolbar



More information about the drbd-user mailing list