Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Dear All,
I have two cluster nodes, for which I use DRBD 8.3 (compiled and installed
by me as rpm) as the shared block device.
The two identical systems comprise of kernel 2.6.18-92.1.22.el5.centos.plus,
drbd-8.3.0-3, drbd-km-2.6.18_92.1.22.el5.centos.plus-8.3.0-3
On top of DRBD resources, I have LVM (clustered) and on top of that GFS2.
The issue is the following.
I run both nodes as Primary/Primary. On both nodes, various applications
run, that concurrently write on the filesystem (however not on the same
files or even directories). I get randomly but constantly (i.e. it always
happens at least once per day) the following errors:
On node 1:
........
drbd1: susp( 1 -> 0 )
drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 9
drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 8
drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 7
drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 6
drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 5
drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 4
drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 3
drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 2
drbd1: [drbd1_worker/4658] sock_sendmsg time expired, ko = 1
drbd1: peer( Primary -> Unknown ) conn( WFBitMapS -> Timeout ) pdsk(
UpToDate -> DUnknown ) susp( 0 -> 1 )
drbd1: short sent ReportBitMap size=4096 sent=276
drbd1: short read expecting header on sock: r=-512
drbd1: asender terminated
drbd1: Terminating asender thread
drbd1: Connection closed
drbd1: helper command: /sbin/drbdadm fence-peer minor-1
drbd1: helper command: /sbin/drbdadm fence-peer minor-1 exit code 2 (0x200)
drbd1: fence-peer helper broken, returned 2
drbd1: Considering state change from bad state. Error would be: 'Refusing to
be Primary while peer is not outdated'
drbd1: old = { cs:Timeout ro:Primary/Unknown ds:UpToDate/DUnknown s--- }
drbd1: new = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown s---
}
drbd1: conn( Timeout -> Unconnected )
drbd1: receiver terminated
drbd1: Restarting receiver thread
drbd1: receiver (re)started
drbd1: Considering state change from bad state. Error would be: 'Refusing to
be Primary while peer is not outdated'
drbd1: old = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown s---
}
drbd1: new = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown s---
}
drbd1: conn( Unconnected -> WFConnection )
...............
[root at tweety-1 ~]# drbdadm status
<drbd-status version="8.3.0" api="88">
<resources config_file="/etc/drbd.conf">
<resource minor="0" name="r0" cs="Connected" ro1="Primary" ro2="Primary"
ds1="UpToDate" ds2="UpToDate" />
<resource minor="1" name="r1" cs="WFConnection" ro1="Primary" ro2="Unknown"
ds1="UpToDate" ds2="DUnknown" suspended />
</resources>
</drbd-status>
On node 2:
..........
drbd1: sock was reset by peer
drbd1: peer( Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk(
UpToDate -> DUnknown ) susp( 0 -> 1 )
drbd1: short read expecting header on sock: r=-104
drbd1: meta connection shut down by peer.
drbd1: asender terminated
drbd1: Terminating asender thread
drbd1: Creating new current UUID
drbd1: Connection closed
drbd1: helper command: /sbin/drbdadm fence-peer minor-1
drbd1: helper command: /sbin/drbdadm fence-peer minor-1 exit code 2 (0x200)
drbd1: fence-peer helper broken, returned 2
drbd1: Considering state change from bad state. Error would be: 'Refusing to
be Primary while peer is not outdated'
drbd1: old = { cs:BrokenPipe ro:Primary/Unknown ds:UpToDate/DUnknown s--- }
drbd1: new = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown s---
}
drbd1: conn( BrokenPipe -> Unconnected )
drbd1: receiver terminated
drbd1: Restarting receiver thread
drbd1: receiver (re)started
drbd1: Considering state change from bad state. Error would be: 'Refusing to
be Primary while peer is not outdated'
drbd1: old = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown s---
}
drbd1: new = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown s---
}
drbd1: conn( Unconnected -> WFConnection )
drbd1: Handshake successful: Agreed network protocol version 89
drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
drbd1: Considering state change from bad state. Error would be: 'Refusing to
be Primary while peer is not outdated'
drbd1: old = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown s---
}
drbd1: new = { cs:WFReportParams ro:Primary/Unknown ds:UpToDate/DUnknown
s--- }
drbd1: conn( WFConnection -> WFReportParams )
drbd1: Starting asender thread (from drbd1_receiver [4858])
drbd1: data-integrity-alg: crc32c
drbd1: meta connection shut down by peer.
drbd1: conn( WFReportParams -> NetworkFailure )
drbd1: asender terminated
drbd1: Terminating asender thread
..........
[root at tweety-2 ~]# drbdadm status
<drbd-status version="8.3.0" api="88">
<resources config_file="/etc/drbd.conf">
<resource minor="0" name="r0" cs="Connected" ro1="Primary" ro2="Primary"
ds1="UpToDate" ds2="UpToDate" />
<resource minor="1" name="r1" cs="NetworkFailure" ro1="Primary"
ro2="Unknown" ds1="UpToDate" ds2="DUnknown" suspended />
</resources>
</drbd-status>
My drbd.conf is the following:
global {
# minor-count 64;
# dialog-refresh 5; # 5 seconds
# disable-ip-verification;
usage-count yes;
}
common {
protocol C;
syncer {
rate 100M;
al-extents 257;
}
handlers {
pri-on-incon-degr "echo b > /proc/sysrq-trigger ; reboot -f";
pri-lost-after-sb "echo b > /proc/sysrq-trigger ; reboot -f";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
outdate-peer "/sbin/obliterate";
pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD
Alert' root; echo b > /proc/sysrq-trigger ; reboot -f";
split-brain "echo split-brain. drbdadm -- --discard-my-data connect
$DRBD_RESOURCE ? | mail -s 'DRBD Alert' root";
}
startup {
wfc-timeout 100;
degr-wfc-timeout 60; # 1 minutes.
#wait-after-sb;
become-primary-on both;
}
disk {
#on-io-error pass-on;
fencing resource-and-stonith;
}
net {
sndbuf-size 512k;
timeout 60; # (unit = 0.1 seconds)
connect-int 10; # (unit = 1 second)
ping-int 10; # (unit = 1 second)
ping-timeout 50; # (unit = 0.1 seconds)
max-buffers 2048;
max-epoch-size 2048;
ko-count 10;
allow-two-primaries;
cram-hmac-alg "sha1";
shared-secret "tweety";
after-sb-0pri discard-least-changes;
after-sb-1pri violently-as0p;
after-sb-2pri violently-as0p;
rr-conflict call-pri-lost;
data-integrity-alg "crc32c";
}
}
resource r0 {
device /dev/drbd0;
disk /dev/hda4;
meta-disk internal;
on tweety-1 { address 10.254.254.253:7788; }
on tweety-2 { address 10.254.254.254:7788; }
}
resource r1 {
device /dev/drbd1;
disk /dev/hdb4;
meta-disk internal;
on tweety-1 { address 10.254.254.253:7789; }
on tweety-2 { address 10.254.254.254:7789; }
}
I have no idea what this is, and googling did not help.
Obviously this error turns the cluster useless.
The processes get Demonized, and since no fencing is performed (is that
related to the above errors???!!!), manual intervention is needed.
Could someone be kind enough to share his knowledge with me, on what the
problem is, what might cause it and how to solve it?
Thank you All for your time.
Theophanis Kontogiannis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090317/8b36fe70/attachment.htm>