Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi All,
I have a production DRBD, kernel 2.6.15-52-amd64-server (Ubuntu Dapper
Drake 6.0.6.2) and DRBD 8.2.6
root at zim:~# cat /proc/drbd
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by root at zim,
2008-07-13 00:28:31
0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
ns:0 nr:22777372 dw:22777372 dr:0 al:0 bm:1572 lo:0 pe:0 ua:0 ap:0
oos:0
1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
ns:0 nr:1117304644 dw:1117304644 dr:0 al:0 bm:194202 lo:0 pe:0 ua:
0 ap:0 oos:0
Like a few other users I've found list posts from, I am getting
integrity check errors since upgrading.
[1280867.991114] drbd1: Digest integrity check FAILED.
[1280867.996373] drbd1: error receiving Data, l: 4140!
[1280868.001604] drbd1: peer( Primary -> Unknown ) conn( Connected ->
ProtocolError ) pdsk( UpToDate -> DUnknown )
And the rest is well, standard.. on the other side you see
[1283426.854266] drbd1: sock was reset by peer
[1283426.854276] drbd1: peer( Secondary -> Unknown ) conn( Connected -
> BrokenPipe ) pdsk( UpToDate -> DUnknown )
[1283426.854285] drbd1: short read expecting header on sock: r=-104
[1283426.861937] drbd1: meta connection shut down by peer.
What I find curious is this happens very continually.
root at zim:/var/log# grep "Digest integrity check FAILED" syslog|wc -l
249
That's spanning from 6:26:48 AM to 20:54:54 PM - that's a lot of
failures! Always somewhere between about 30 seconds and 10 minutes.
And it's been doing this every day for a couple weeks. On average its
about once every 3 minutes over the last 12 hours.
Now in all the other posts it's been met with the advice that bit
flips happen due to PCI bus noise.. memory.. etc and turning off
offloading can help.. but I do find it really curious I just wouldn't
expect this kind of hard line failure so often from the gear and
before turning on this feature it had been in production (albeit quiet
production) for 2 years with no issue - and I run quite a few other
DRBD systems for 2+ years now and never really had much in the way of
corruption issues - so I find it a little weird that I'm literally
getting corruption on a veerryy slow system every 5-10 minutes, if not
every 30-60 seconds extremely consistently. I guess i can see how
this kind of stuff can go un-noticed, especially on quiet system but
for such a long time with apparently such a high rate of corruption
seems a bit odd.
Has anyone else found this weird? Have their been any reported bugs? I
am using SHA1.. I might try change algorithm but yeh.. I just get the
feeling from quite a few other people having the same -consistent-
issue with it at very regular intervals I can't help but think this
really must be a bug... Otherwise things like SSH and such would
surely just break all the time, etc? Am I missing something here? Or
have people really found this just to be plain some dodgy hardware? I
can't help but feel there's some packet padding going on or something
somewhere causing some upset or something...
Thanks guys, would appreciate the input. Also interested in stories
of how people -solved- this problem, if they have.. as people don't
tend to follow up to the list with that kinda thing when they fix it!
Regards,
--
Trent Lloyd