Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Fri, Feb 25, 2011 at 12:12:15PM +0100, Walter Haidinger wrote: > I've now replaced the onboard NICs used for the drbd link with PCIe models. The integrity checks still fail every couple of hours. > This is hardly suprising, though, because I was unable to reproduce any transmissions errors other than with drbd. > > Is it therefore safe to assume to rule out the network hardware? > > > kernel logs, config, and meta data dump are more interesting. > > Allright. Please tell me if anything else is interesting too. > Any hints regarding howto diagnose this problem are highly appreciated! > > Please note that the system is otherwise stable, no problems except the > failed integrity checks of drbd. So you no longer have any problems/ASSERTs regarding drbd_al_read_log? > /proc/drbd: > version: 8.3.10 (api:88/proto:86-96) > GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by @build.k9, 2011-02-25 09:08:11 > 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- > ns:0 nr:9220020 dw:9220016 dr:3065364 al:0 bm:881 lo:1 pe:0 ua:1 ap:0 ep:1 wo:b oos:0 > resync: used:0/61 hits:51 misses:11 starving:0 dirty:0 changed:11 > act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0 > > drbdadm dump: > # /etc/drbd.conf > global { minor-count 16; } > > common { > net { > data-integrity-alg md5; > sndbuf-size 1M; > rcvbuf-size 1M; > } > syncer { > rate 100M; > c-plan-ahead 30; > c-fill-target 4k; > c-max-rate 120M; > c-min-rate 1024; > verify-alg sha1; > csums-alg sha1; > } > } > > # resource md3 on prod1b.k9: not ignored, not stacked > resource md3 { > protocol C; > on prod1a.k9 { > device /dev/drbd0 minor 0; > disk /dev/md3; > address ipv4 192.168.10.1:7788; > flexible-meta-disk /dev/sys/drbd_meta0; > } > on prod1b.k9 { > device /dev/drbd0 minor 0; > disk /dev/md3; > address ipv4 192.168.10.2:7788; > flexible-meta-disk /dev/sys/drbd_meta0; > } > net { > timeout 100; > connect-int 10; > ping-int 10; > ping-timeout 5; > max-buffers 4096; > unplug-watermark 2048; > max-epoch-size 4096; > ko-count 5; > cram-hmac-alg sha256; > shared-secret secret; > after-sb-0pri discard-younger-primary; > after-sb-1pri consensus; > after-sb-2pri disconnect; > rr-conflict disconnect; > } > disk { > on-io-error detach; > fencing dont-care; > } > syncer { > al-extents 3389; > } > startup { > wfc-timeout 120; > degr-wfc-timeout 30; > } > handlers { > pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; > pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; > local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; > fence-peer /usr/sbin/drbd-peer-outdater; > } > > kernel dmesg output of an error: > drbd0: Digest integrity check FAILED: 182846680s +4096 > drbd0: error receiving Data, l: 4136! > drbd0: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) Well, what does the other (Primary) side say? I'd expect it to say "Digest mismatch, buffer modified by upper layers during write: ..." If it does not, your link corrupty data. If it does, well, then that's what happens. (note: this double check on the sending side has only been introduced with 8.3.10) -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed