Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I've now replaced the onboard NICs used for the drbd link with PCIe models. The integrity checks still fail every couple of hours. This is hardly suprising, though, because I was unable to reproduce any transmissions errors other than with drbd. Is it therefore safe to assume to rule out the network hardware? > kernel logs, config, and meta data dump are more interesting. Allright. Please tell me if anything else is interesting too. Any hints regarding howto diagnose this problem are highly appreciated! Please note that the system is otherwise stable, no problems except the failed integrity checks of drbd. /proc/drbd: version: 8.3.10 (api:88/proto:86-96) GIT-hash: 5c0b0469666682443d4785d90a2c603378f9017b build by @build.k9, 2011-02-25 09:08:11 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:9220020 dw:9220016 dr:3065364 al:0 bm:881 lo:1 pe:0 ua:1 ap:0 ep:1 wo:b oos:0 resync: used:0/61 hits:51 misses:11 starving:0 dirty:0 changed:11 act_log: used:0/3389 hits:0 misses:0 starving:0 dirty:0 changed:0 drbdadm dump: # /etc/drbd.conf global { minor-count 16; } common { net { data-integrity-alg md5; sndbuf-size 1M; rcvbuf-size 1M; } syncer { rate 100M; c-plan-ahead 30; c-fill-target 4k; c-max-rate 120M; c-min-rate 1024; verify-alg sha1; csums-alg sha1; } } # resource md3 on prod1b.k9: not ignored, not stacked resource md3 { protocol C; on prod1a.k9 { device /dev/drbd0 minor 0; disk /dev/md3; address ipv4 192.168.10.1:7788; flexible-meta-disk /dev/sys/drbd_meta0; } on prod1b.k9 { device /dev/drbd0 minor 0; disk /dev/md3; address ipv4 192.168.10.2:7788; flexible-meta-disk /dev/sys/drbd_meta0; } net { timeout 100; connect-int 10; ping-int 10; ping-timeout 5; max-buffers 4096; unplug-watermark 2048; max-epoch-size 4096; ko-count 5; cram-hmac-alg sha256; shared-secret secret; after-sb-0pri discard-younger-primary; after-sb-1pri consensus; after-sb-2pri disconnect; rr-conflict disconnect; } disk { on-io-error detach; fencing dont-care; } syncer { al-extents 3389; } startup { wfc-timeout 120; degr-wfc-timeout 30; } handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; fence-peer /usr/sbin/drbd-peer-outdater; } kernel dmesg output of an error: drbd0: Digest integrity check FAILED: 182846680s +4096 drbd0: error receiving Data, l: 4136! drbd0: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) drbd0: asender terminated drbd0: Terminating asender thread drbd0: Connection closed drbd0: conn( ProtocolError -> Unconnected ) drbd0: receiver terminated drbd0: Restarting receiver thread drbd0: receiver (re)started drbd0: conn( Unconnected -> WFConnection ) drbd0: Handshake successful: Agreed network protocol version 96 drbd0: Peer authenticated using 32 bytes of 'sha256' HMAC drbd0: conn( WFConnection -> WFReportParams ) drbd0: Starting asender thread (from drbd0_receiver [5679]) drbd0: data-integrity-alg: md5 drbd0: max BIO size = 130560 drbd0: drbd_sync_handshake: drbd0: self 232D95BBCD88356C:0000000000000000:9FD19CF528E7A53A:9FD09CF528E7A53B bits:0 flags:0 drbd0: peer 5C46D84FC9C15C7D:232D95BBCD88356D:9FD19CF528E7A53B:9FD09CF528E7A53B bits:1 flags:0 drbd0: uuid_compare()=-1 by rule 50 drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate ) drbd0: conn( WFBitMapT -> WFSyncUUID ) drbd0: updated sync uuid 232E95BBCD88356C:0000000000000000:9FD19CF528E7A53A:9FD09CF528E7A53B drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) drbd0: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) drbd0: Began resync as SyncTarget (will sync 736 KB [184 bits set]). drbd0: Resync done (total 1 sec; paused 0 sec; 736 K/sec) drbd0: 0 % had equal check sums, eliminated: 0K; transferred 736K total 736K drbd0: updated UUIDs 5C46D84FC9C15C7C:0000000000000000:232E95BBCD88356C:232D95BBCD88356D drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 drbd0: helper command: /sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0) drbd0: bitmap WRITE of 5924 pages took 13 jiffies drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. #drbdadm dump-md md3 # DRBD meta data dump # 2011-02-25 11:58:33 +0100 [1298631513] # prod1b.k9> drbdmeta 0 v08 /dev/sys/drbd_meta0 flex-external dump-md # version "v08"; # md_size_sect 139264 # md_offset 0 # al_offset 4096 # bm_offset 36864 uuid { 0x5C46D84FC9C15C7C; 0x0000000000000000; 0x232E95BBCD88356C; 0x232D95BBCD88356D; flags 0x00000011; } # al-extents 3389; la-size-sect 1555043584; bm-byte-per-bit 4096; device-uuid 0x95962A5B877A5C33; # bm-bytes 24297560; bm { # at 0kB 3037248 times 0x0000000000000000; } # bits-set 0; Last but not least the system configuration: Two nodes, identical hardware, running as a simple active/passive heartbeat v1 cluster (no CRM). OS: CentOS 5.5 x86_64 with vanilla 2.6.35.11 kernel and drbd 8.3.10. HW: Asus M3A-H mainboard, Phenom X4 965, 8G DDR2-800 ECC (EDAC enabled). NICs (all Gigabit): Onboard Atheros L1, PCIe Intel 82572EI, PCIe Intel 82574L (used as dedicated drbd link, directly connected, no switch) Storage: drbd on top of 3-way raid-1 (Linux md software-raid of SATA drives), LVM on top of drbd, all filesystems ext3. Again, if anything else is interesting (lsmod, lspci?), just tell me. Regards, Walter -- Schon gehört? GMX hat einen genialen Phishing-Filter in die Toolbar eingebaut! http://www.gmx.net/de/go/toolbar