Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello all again. In continuation to the bellow described issue, with integrity check enabled, I used to get a crash at least once per 24 hours. Now I have integrity check disabled and the cluster is running without crashes for the last 9 days. Could someone kindly provide some hints for the possible reasons of this observed behavior? Off-loading is disabled on both dedicated gigabit NICs. Also is integrity-check really needed (I have read the documentation :) ) if it keeps on breaking the cluster? Thank you All for your time. Theophanis Kontogiannis On Tue, 2009-10-20 at 20:31 +0300, Theophanis Kontogiannis wrote: > Hello all, > > Eventually I managed to get a log during DRBD crash. > > I have a two nodes RHEL5.3 cluster with 2.6.18-164.el5xen and > drbd-8.3.1-3 self compiled. > > Both nodes have a dedicated 1G ethernet back to back connection over > RTL8169sb/8110sb cards. > > When I run applications, that constantly read or write to the disks > (active/active config), drbd kept on crashing. > > Now I have the logs for the reason of that: > > > ______________________ > ON TWEETY1 > > Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check > FAILED. > Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check > FAILED. > Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540! > Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540! > Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) > conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) > susp( 0 -> 1 ) > Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) > conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) > susp( 0 -> 1 ) > Oct 20 15:46:52 localhost kernel: drbd2: asender terminated > Oct 20 15:46:52 localhost kernel: drbd2: asender terminated > Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread > Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread > Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID > Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID > Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info> > Executing /etc/init.d/drbd status > Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info> > Executing /etc/init.d/drbd status > Oct 20 15:46:52 localhost kernel: drbd2: Connection closed > Oct 20 15:46:52 localhost kernel: drbd2: Connection closed > > ___________________________ > > ON TWEETY2 > > > Oct 20 15:46:52 localhost kernel: drbd2: sock was reset by peer > Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) > conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0 > -> 1 ) > Oct 20 15:46:52 localhost kernel: drbd2: short read expecting header > on sock: r=-104 > Oct 20 15:46:52 localhost kernel: drbd2: meta connection shut down by > peer. > Oct 20 15:46:52 localhost kernel: drbd2: asender terminated > Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread > Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID > Oct 20 15:46:52 localhost kernel: drbd2: Connection closed > Oct 20 15:46:52 localhost kernel: drbd2: helper command: /sbin/drbdadm > fence-peer minor-2 > > ____________________ > > > DRBD.CONF > > > # > # drbd.conf > # > > > global { > > usage-count yes; > } > > > common { > > protocol C; > > syncer { > > rate 100M; > > al-extents 257; > } > > > handlers { > > pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f"; > > pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f"; > > local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; > > outdate-peer "/sbin/obliterate"; > > > pri-lost "echo pri-lost. Have a look at the log files. | mail -s > 'DRBD Alert' root; echo b > /proc/sysrq-trigger ; reboot -f"; > > split-brain "echo split-brain. drbdadm -- --discard-my-data > connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root"; > > } > > startup { > > wfc-timeout 60; > > > degr-wfc-timeout 60; # 1 minutes. > > > become-primary-on both; > > } > > disk { > > fencing resource-and-stonith; > > > } > > net { > > sndbuf-size 512k; > > timeout 60; # 6 seconds (unit = 0.1 seconds) > connect-int 10; # 10 seconds (unit = 1 second) > ping-int 10; # 10 seconds (unit = 1 second) > ping-timeout 50; # 500 ms (unit = 0.1 seconds) > > max-buffers 2048; > > max-epoch-size 2048; > > ko-count 10; > > > allow-two-primaries; > > > cram-hmac-alg "sha1"; > shared-secret "*****"; > > > after-sb-0pri discard-least-changes; > > after-sb-1pri violently-as0p; > > > after-sb-2pri violently-as0p; > > > rr-conflict call-pri-lost; > > > data-integrity-alg "crc32c"; > > } > > > } > > > resource r0 { > > device /dev/drbd0; > disk /dev/hda4; > meta-disk internal; > > on tweety-1 { address 10.254.254.253:7788; } > > on tweety-2 { address 10.254.254.254:7788; } > > } > > resource r1 { > > device /dev/drbd1; > disk /dev/hdb4; > meta-disk internal; > > on tweety-1 { address 10.254.254.253:7789; } > > on tweety-2 { address 10.254.254.254:7789; } > } > > resource r2 { > > device /dev/drbd2; > disk /dev/sda1; > meta-disk internal; > > on tweety-1 { address 10.254.254.253:7790; } > > on tweety-2 { address 10.254.254.254:7790; } > } > > _________ > > Also available in http://pastebin.ca/1633173 > > > How can I solve this? > > Thank you All for your time. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20091029/87d14164/attachment.htm>