[DRBD-user] Digest integrity check FAILED - common pattern?

Sun Aug 31 20:30:59 CEST 2008

On Sat, Aug 30, 2008 at 09:03:15PM +0800, Trent Lloyd wrote:
> Hi All,
>
> I have a production DRBD, kernel 2.6.15-52-amd64-server (Ubuntu Dapper  
> Drake 6.0.6.2) and DRBD 8.2.6
>
> root at zim:~# cat /proc/drbd
> version: 8.2.6 (api:88/proto:86-88)
> GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by root at zim,  
> 2008-07-13 00:28:31
> 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
>    ns:0 nr:22777372 dw:22777372 dr:0 al:0 bm:1572 lo:0 pe:0 ua:0 ap:0  
> oos:0
> 1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
>    ns:0 nr:1117304644 dw:1117304644 dr:0 al:0 bm:194202 lo:0 pe:0 ua:0 
> ap:0 oos:0
>
> Like a few other users I've found list posts from, I am getting  
> integrity check errors since upgrading.
>
> [1280867.991114] drbd1: Digest integrity check FAILED.
> [1280867.996373] drbd1: error receiving Data, l: 4140!
> [1280868.001604] drbd1: peer( Primary -> Unknown ) conn( Connected ->  
> ProtocolError ) pdsk( UpToDate -> DUnknown )
>
> And the rest is well, standard.. on the other side you see
> [1283426.854266] drbd1: sock was reset by peer
> [1283426.854276] drbd1: peer( Secondary -> Unknown ) conn( Connected -> 
> BrokenPipe ) pdsk( UpToDate -> DUnknown )
> [1283426.854285] drbd1: short read expecting header on sock: r=-104
> [1283426.861937] drbd1: meta connection shut down by peer.
>
> What I find curious is this happens very continually.
>
> root at zim:/var/log# grep "Digest integrity check FAILED" syslog|wc -l
> 249
>
> That's spanning from 6:26:48 AM to 20:54:54 PM - that's a lot of  
> failures! Always somewhere between about 30 seconds and 10 minutes.  And 
> it's been doing this every day for a couple weeks.  On average its about 
> once every 3 minutes over the last 12 hours.
>
> Now in all the other posts it's been met with the advice that bit flips 
> happen due to PCI bus noise.. memory.. etc and turning off offloading can 
> help.. but I do find it really curious I just wouldn't expect this kind 
> of hard line failure so often from the gear and before turning on this 
> feature it had been in production (albeit quiet production) for 2 years 
> with no issue - and I run quite a few other DRBD systems for 2+ years now 
> and never really had much in the way of corruption issues - so I find it 
> a little weird that I'm literally getting corruption on a veerryy slow 
> system every 5-10 minutes, if not every 30-60 seconds extremely 
> consistently.  I guess i can see how this kind of stuff can go 
> un-noticed, especially on quiet system but for such a long time with 
> apparently such a high rate of corruption seems a bit odd.
>
> Has anyone else found this weird? Have their been any reported bugs? I  
> am using SHA1.. I might try change algorithm but yeh.. I just get the  
> feeling from quite a few other people having the same -consistent- issue 
> with it at very regular intervals I can't help but think this really must 
> be a bug... Otherwise things like SSH and such would surely just break 
> all the time, etc? Am I missing something here? Or have people really 
> found this just to be plain some dodgy hardware? I can't help but feel 
> there's some packet padding going on or something somewhere causing some 
> upset or something...
>
> Thanks guys, would appreciate the input.  Also interested in stories of 
> how people -solved- this problem, if they have.. as people don't tend to 
> follow up to the list with that kinda thing when they fix it!

you probably read the other threads where I explain the various sources
of online verification errors or digest integrity failures.

for digest integrity failures, as the tcp stream has to pass the tcp
checksum as well, bit flips or other corruption can happen
after the digest has been calculated, but before the tcp checksum is
calculated, so in or on the way to the tcp-checksum offload engine on the NIC,
or in any store-and-forward (and possibly fragmenting) network
component (or after the tcp-checksum has been verified, but before the
digest is verified).

it is also quite possible that there is something fishy in the generic
write-out path, occasionally re-using buffers of in-flight data.

and, sure, DRBD may have some problem somewhere.
but all "assumed to be false positives" I encountered lately,
turned out to be real data corruption, and indeed flaky hardware,
I think we can trust the code until proven otherwise.

just some suggestions to pin down the root cause.
you can try
 * with different nics
 * with different kernel
 * with different mother board

on the node that logs
> root at zim:/var/log# grep "Digest integrity check FAILED" syslog

# drbdsetup /dev/drbd0 events -a -u | tee some-log-file

and wait for a few such digest integrity check failed events to happen.
among other things, there now should be hexdumps of the failed requests
in this log.  send me that "some-log" in private mail.

-- 
: Lars Ellenberg                
: LINBIT HA-Solutions GmbH
: DRBD®/HA support and consulting    http://www.linbit.com

DRBD® and LINBIT® are registered trademarks
of LINBIT Information Technologies GmbH
__
please don't Cc me, but send to list   --   I'm subscribed