[DRBD-user] Digest integrity check FAILED - common pattern?

Sat Aug 30 15:03:15 CEST 2008

Hi All,

I have a production DRBD, kernel 2.6.15-52-amd64-server (Ubuntu Dapper  
Drake 6.0.6.2) and DRBD 8.2.6

root at zim:~# cat /proc/drbd
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by root at zim,  
2008-07-13 00:28:31
0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
    ns:0 nr:22777372 dw:22777372 dr:0 al:0 bm:1572 lo:0 pe:0 ua:0 ap:0  
oos:0
1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
    ns:0 nr:1117304644 dw:1117304644 dr:0 al:0 bm:194202 lo:0 pe:0 ua: 
0 ap:0 oos:0

Like a few other users I've found list posts from, I am getting  
integrity check errors since upgrading.

[1280867.991114] drbd1: Digest integrity check FAILED.
[1280867.996373] drbd1: error receiving Data, l: 4140!
[1280868.001604] drbd1: peer( Primary -> Unknown ) conn( Connected ->  
ProtocolError ) pdsk( UpToDate -> DUnknown )

And the rest is well, standard.. on the other side you see
[1283426.854266] drbd1: sock was reset by peer
[1283426.854276] drbd1: peer( Secondary -> Unknown ) conn( Connected - 
 > BrokenPipe ) pdsk( UpToDate -> DUnknown )
[1283426.854285] drbd1: short read expecting header on sock: r=-104
[1283426.861937] drbd1: meta connection shut down by peer.

What I find curious is this happens very continually.

root at zim:/var/log# grep "Digest integrity check FAILED" syslog|wc -l
249

That's spanning from 6:26:48 AM to 20:54:54 PM - that's a lot of  
failures! Always somewhere between about 30 seconds and 10 minutes.   
And it's been doing this every day for a couple weeks.  On average its  
about once every 3 minutes over the last 12 hours.

Now in all the other posts it's been met with the advice that bit  
flips happen due to PCI bus noise.. memory.. etc and turning off  
offloading can help.. but I do find it really curious I just wouldn't  
expect this kind of hard line failure so often from the gear and  
before turning on this feature it had been in production (albeit quiet  
production) for 2 years with no issue - and I run quite a few other  
DRBD systems for 2+ years now and never really had much in the way of  
corruption issues - so I find it a little weird that I'm literally  
getting corruption on a veerryy slow system every 5-10 minutes, if not  
every 30-60 seconds extremely consistently.  I guess i can see how  
this kind of stuff can go un-noticed, especially on quiet system but  
for such a long time with apparently such a high rate of corruption  
seems a bit odd.

Has anyone else found this weird? Have their been any reported bugs? I  
am using SHA1.. I might try change algorithm but yeh.. I just get the  
feeling from quite a few other people having the same -consistent-  
issue with it at very regular intervals I can't help but think this  
really must be a bug... Otherwise things like SSH and such would  
surely just break all the time, etc? Am I missing something here? Or  
have people really found this just to be plain some dodgy hardware? I  
can't help but feel there's some packet padding going on or something  
somewhere causing some upset or something...

Thanks guys, would appreciate the input.  Also interested in stories  
of how people -solved- this problem, if they have.. as people don't  
tend to follow up to the list with that kinda thing when they fix it!

Regards,
-- 
Trent Lloyd