<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">
<META NAME="GENERATOR" CONTENT="GtkHTML/3.26.3">
</HEAD>
<BODY>
Hello all,<BR>
<BR>
Eventually I managed to get a log during DRBD crash.<BR>
<BR>
I have a two nodes RHEL5.3 cluster with 2.6.18-164.el5xen and drbd-8.3.1-3 self compiled.<BR>
<BR>
Both nodes have a dedicated 1G ethernet back to back connection over RTL8169sb/8110sb cards.<BR>
<BR>
When I run applications, that constantly read or write to the disks (active/active config), drbd kept on crashing.<BR>
<BR>
Now I have the logs for the reason of that:<BR>
<BR>
<BR>
______________________<BR>
ON TWEETY1<BR>
<BR>
Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check FAILED.<BR>
Oct 20 15:46:52 localhost kernel: drbd2: Digest integrity check FAILED.<BR>
Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!<BR>
Oct 20 15:46:52 localhost kernel: drbd2: error receiving Data, l: 540!<BR>
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) <BR>
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) <BR>
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated<BR>
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated<BR>
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread<BR>
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread<BR>
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID<BR>
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID<BR>
Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info> Executing /etc/init.d/drbd status <BR>
Oct 20 15:46:52 localhost clurgmgrd: [4161]: <info> Executing /etc/init.d/drbd status <BR>
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed<BR>
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed<BR>
<BR>
___________________________<BR>
<BR>
ON TWEETY2<BR>
<BR>
<BR>
Oct 20 15:46:52 localhost kernel: drbd2: sock was reset by peer<BR>
Oct 20 15:46:52 localhost kernel: drbd2: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) <BR>
Oct 20 15:46:52 localhost kernel: drbd2: short read expecting header on sock: r=-104<BR>
Oct 20 15:46:52 localhost kernel: drbd2: meta connection shut down by peer.<BR>
Oct 20 15:46:52 localhost kernel: drbd2: asender terminated<BR>
Oct 20 15:46:52 localhost kernel: drbd2: Terminating asender thread<BR>
Oct 20 15:46:52 localhost kernel: drbd2: Creating new current UUID<BR>
Oct 20 15:46:52 localhost kernel: drbd2: Connection closed<BR>
Oct 20 15:46:52 localhost kernel: drbd2: helper command: /sbin/drbdadm fence-peer minor-2<BR>
<BR>
____________________<BR>
<BR>
<BR>
DRBD.CONF<BR>
<BR>
<BR>
#<BR>
# drbd.conf<BR>
#<BR>
<BR>
<BR>
global {<BR>
<BR>
usage-count yes;<BR>
}<BR>
<BR>
<BR>
common {<BR>
<BR>
protocol C;<BR>
<BR>
syncer {<BR>
<BR>
rate 100M;<BR>
<BR>
al-extents 257;<BR>
}<BR>
<BR>
<BR>
handlers {<BR>
<BR>
pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f";<BR>
<BR>
pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";<BR>
<BR>
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";<BR>
<BR>
outdate-peer "/sbin/obliterate";<BR>
<BR>
<BR>
pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD Alert' root; echo b > /proc/sysrq-trigger ; reboot -f";<BR>
<BR>
split-brain "echo split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail -s 'DRBD Alert' root";<BR>
<BR>
}<BR>
<BR>
startup {<BR>
<BR>
wfc-timeout 60;<BR>
<BR>
<BR>
degr-wfc-timeout 60; # 1 minutes.<BR>
<BR>
<BR>
become-primary-on both;<BR>
<BR>
}<BR>
<BR>
disk {<BR>
<BR>
fencing resource-and-stonith;<BR>
<BR>
<BR>
}<BR>
<BR>
net {<BR>
<BR>
sndbuf-size 512k;<BR>
<BR>
timeout 60; # 6 seconds (unit = 0.1 seconds)<BR>
connect-int 10; # 10 seconds (unit = 1 second)<BR>
ping-int 10; # 10 seconds (unit = 1 second)<BR>
ping-timeout 50; # 500 ms (unit = 0.1 seconds)<BR>
<BR>
max-buffers 2048;<BR>
<BR>
max-epoch-size 2048;<BR>
<BR>
ko-count 10;<BR>
<BR>
<BR>
allow-two-primaries;<BR>
<BR>
<BR>
cram-hmac-alg "sha1";<BR>
shared-secret "*****";<BR>
<BR>
<BR>
after-sb-0pri discard-least-changes;<BR>
<BR>
after-sb-1pri violently-as0p;<BR>
<BR>
<BR>
after-sb-2pri violently-as0p;<BR>
<BR>
<BR>
rr-conflict call-pri-lost;<BR>
<BR>
<BR>
data-integrity-alg "crc32c";<BR>
<BR>
}<BR>
<BR>
<BR>
}<BR>
<BR>
<BR>
resource r0 {<BR>
<BR>
device /dev/drbd0;<BR>
disk /dev/hda4;<BR>
meta-disk internal;<BR>
<BR>
on tweety-1 { address 10.254.254.253:7788; }<BR>
<BR>
on tweety-2 { address 10.254.254.254:7788; }<BR>
<BR>
}<BR>
<BR>
resource r1 {<BR>
<BR>
device /dev/drbd1;<BR>
disk /dev/hdb4;<BR>
meta-disk internal;<BR>
<BR>
on tweety-1 { address 10.254.254.253:7789; }<BR>
<BR>
on tweety-2 { address 10.254.254.254:7789; }<BR>
}<BR>
<BR>
resource r2 {<BR>
<BR>
        device                /dev/drbd2;<BR>
        disk                /dev/sda1;<BR>
        meta-disk        internal;<BR>
<BR>
on tweety-1 { address 10.254.254.253:7790; }<BR>
<BR>
on tweety-2 { address 10.254.254.254:7790; }<BR>
}<BR>
<BR>
_________<BR>
<BR>
Also available in <A HREF="http://pastebin.ca/1633173">http://pastebin.ca/1633173</A><BR>
<BR>
<BR>
How can I solve this?<BR>
<BR>
Thank you All for your time.<BR>
<BR>
<BR>
</BODY>
</HTML>