Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Fri, Feb 25, 2011 at 12:12:15PM +0100, Walter Haidinger wrote:
> I've now replaced the onboard NICs used for the drbd link with PCIe models. The integrity checks still fail every couple of hours.
> This is hardly suprising, though, because I was unable to reproduce any transmissions errors other than with drbd.
>
> Is it therefore safe to assume to rule out the network hardware?
>
> > kernel logs, config, and meta data dump are more interesting.
>
> Allright. Please tell me if anything else is interesting too.
> Any hints regarding howto diagnose this problem are highly appreciated!
>
> Please note that the system is otherwise stable, no problems except the
> failed integrity checks of drbd.
So you no longer have any problems/ASSERTs regarding drbd_al_read_log?
> /proc/drbd:
> version: 8.3.10 (api:88/proto:86-96)
> GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by @build.k9, 2011-02-25 09:08:11
> 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
> ns:0 nr:9220020 dw:9220016 dr:3065364 al:0 bm:881 lo:1 pe:0 ua:1 ap:0 ep:1 wo:b oos:0
> resync: used:0/61 hits:51 misses:11 starving:0 dirty:0 changed:11
> act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0
>
> drbdadm dump:
> # /etc/drbd.conf
> global { minor-count 16; }
>
> common {
> net {
> data-integrity-alg md5;
> sndbuf-size 1M;
> rcvbuf-size 1M;
> }
> syncer {
> rate 100M;
> c-plan-ahead 30;
> c-fill-target 4k;
> c-max-rate 120M;
> c-min-rate 1024;
> verify-alg sha1;
> csums-alg sha1;
> }
> }
>
> # resource md3 on prod1b.k9: not ignored, not stacked
> resource md3 {
> protocol C;
> on prod1a.k9 {
> device /dev/drbd0 minor 0;
> disk /dev/md3;
> address ipv4 192.168.10.1:7788;
> flexible-meta-disk /dev/sys/drbd_meta0;
> }
> on prod1b.k9 {
> device /dev/drbd0 minor 0;
> disk /dev/md3;
> address ipv4 192.168.10.2:7788;
> flexible-meta-disk /dev/sys/drbd_meta0;
> }
> net {
> timeout 100;
> connect-int 10;
> ping-int 10;
> ping-timeout 5;
> max-buffers 4096;
> unplug-watermark 2048;
> max-epoch-size 4096;
> ko-count 5;
> cram-hmac-alg sha256;
> shared-secret secret;
> after-sb-0pri discard-younger-primary;
> after-sb-1pri consensus;
> after-sb-2pri disconnect;
> rr-conflict disconnect;
> }
> disk {
> on-io-error detach;
> fencing dont-care;
> }
> syncer {
> al-extents 3389;
> }
> startup {
> wfc-timeout 120;
> degr-wfc-timeout 30;
> }
> handlers {
> pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
> pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
> local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
> fence-peer /usr/sbin/drbd-peer-outdater;
> }
>
> kernel dmesg output of an error:
> drbd0: Digest integrity check FAILED: 182846680s +4096
> drbd0: error receiving Data, l: 4136!
> drbd0: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )
Well, what does the other (Primary) side say?
I'd expect it to say
"Digest mismatch, buffer modified by upper layers during write: ..."
If it does not, your link corrupty data.
If it does, well, then that's what happens.
(note: this double check on the sending side
has only been introduced with 8.3.10)
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed