Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi All, I have a production DRBD, kernel 2.6.15-52-amd64-server (Ubuntu Dapper Drake 6.0.6.2) and DRBD 8.2.6 root at zim:~# cat /proc/drbd version: 8.2.6 (api:88/proto:86-88) GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by root at zim, 2008-07-13 00:28:31 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--- ns:0 nr:22777372 dw:22777372 dr:0 al:0 bm:1572 lo:0 pe:0 ua:0 ap:0 oos:0 1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r--- ns:0 nr:1117304644 dw:1117304644 dr:0 al:0 bm:194202 lo:0 pe:0 ua: 0 ap:0 oos:0 Like a few other users I've found list posts from, I am getting integrity check errors since upgrading. [1280867.991114] drbd1: Digest integrity check FAILED. [1280867.996373] drbd1: error receiving Data, l: 4140! [1280868.001604] drbd1: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) And the rest is well, standard.. on the other side you see [1283426.854266] drbd1: sock was reset by peer [1283426.854276] drbd1: peer( Secondary -> Unknown ) conn( Connected - > BrokenPipe ) pdsk( UpToDate -> DUnknown ) [1283426.854285] drbd1: short read expecting header on sock: r=-104 [1283426.861937] drbd1: meta connection shut down by peer. What I find curious is this happens very continually. root at zim:/var/log# grep "Digest integrity check FAILED" syslog|wc -l 249 That's spanning from 6:26:48 AM to 20:54:54 PM - that's a lot of failures! Always somewhere between about 30 seconds and 10 minutes. And it's been doing this every day for a couple weeks. On average its about once every 3 minutes over the last 12 hours. Now in all the other posts it's been met with the advice that bit flips happen due to PCI bus noise.. memory.. etc and turning off offloading can help.. but I do find it really curious I just wouldn't expect this kind of hard line failure so often from the gear and before turning on this feature it had been in production (albeit quiet production) for 2 years with no issue - and I run quite a few other DRBD systems for 2+ years now and never really had much in the way of corruption issues - so I find it a little weird that I'm literally getting corruption on a veerryy slow system every 5-10 minutes, if not every 30-60 seconds extremely consistently. I guess i can see how this kind of stuff can go un-noticed, especially on quiet system but for such a long time with apparently such a high rate of corruption seems a bit odd. Has anyone else found this weird? Have their been any reported bugs? I am using SHA1.. I might try change algorithm but yeh.. I just get the feeling from quite a few other people having the same -consistent- issue with it at very regular intervals I can't help but think this really must be a bug... Otherwise things like SSH and such would surely just break all the time, etc? Am I missing something here? Or have people really found this just to be plain some dodgy hardware? I can't help but feel there's some packet padding going on or something somewhere causing some upset or something... Thanks guys, would appreciate the input. Also interested in stories of how people -solved- this problem, if they have.. as people don't tend to follow up to the list with that kinda thing when they fix it! Regards, -- Trent Lloyd