Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello,
I am testing DRBD 8.3.10 with Heartbeat 2.1.4 configuration and occurs
some problem connected with DRBD and Heartbeat timeouts. It looks like
something wrong with fencing mechanism or closing connection by DRBD.
Please have a look at my configuration:
drbdsetup 0 show
disk {
size 0s _is_default; # bytes
on-io-error call-local-io-error;
fencing resource-only;
max-bio-bvecs 0 _is_default;
}
net {
timeout 60 _is_default; # 1/10 seconds
max-epoch-size 2048 _is_default;
max-buffers 2048 _is_default;
unplug-watermark 128 _is_default;
connect-int 8; # seconds
ping-int 7; # seconds
sndbuf-size 0 _is_default; # bytes
rcvbuf-size 0 _is_default; # bytes
ko-count 0 _is_default;
after-sb-0pri disconnect _is_default;
after-sb-1pri disconnect _is_default;
after-sb-2pri disconnect _is_default;
rr-conflict disconnect _is_default;
ping-timeout 5 _is_default; # 1/10 seconds
on-congestion block _is_default;
congestion-fill 0s _is_default; # byte
congestion-extents 127 _is_default;
}
syncer {
rate 40960k; # bytes/second
after -1 _is_default;
al-extents 127 _is_default;
on-no-data-accessible io-error _is_default;
c-plan-ahead 0 _is_default; # 1/10 seconds
c-delay-target 10 _is_default; # 1/10 seconds
c-fill-target 0s _is_default; # bytes
c-max-rate 102400k _is_default; # bytes/second
c-min-rate 40960k; # bytes/second
}
protocol C;
_this_host {
device minor 0;
device "/dev/RA3nr2_drbd";
disk "/dev/vg+vg00/lv0000";
meta-disk "/dev/vg+vg00/RA3nr2" [ 0 ];
address ipv4 77.77.77.2:12000;
}
_remote_host {
address ipv4 77.77.77.5:12000;
Heartbeat configuration:
cat /etc/ha.d/ha.cf
debug 0
use_logd off
debugfile /mnt/hda1/log/ha-debug
logfile /mnt/hda1/log/ha-log
logfacility local0
uuidfrom nodename
udpport 694
auto_failback off
node 23751186 57017938
warntime 5000ms
deadtime 10000ms
initdead 15000ms
keepalive 1000ms
ping_group default_group 77.77.77.89
respawn hacluster /usr/lib/heartbeat/ipfail
respawn hacluster /usr/lib/heartbeat/dopd
apiauth dopd gid=haclient uid=hacluster
ucast eth1 192.168.248.176
ucast eth2 77.77.77.5
so DBRD timeouts are lower than deadtime Heartbeat timeout.
After unplugging replication link and after about 7 seconds dmesg show
information about not arriving PingAck in time
[15904.320048] block drbd0: PingAck did not arrive in time.
[15904.320062] block drbd0: peer( Secondary -> Unknown ) conn( Connected
-> NetworkFailure ) pdsk( UpToDate -> DUnknown )
[15904.328676] block drbd0: asender terminated
[15904.328684] block drbd0: Terminating asender thread
Here is 11 sec delays and marking resource peer as outdated is too late
[15915.810087] block drbd0: new current UUID
D9FCA453B58FC9FD:17A9D53C5618F549:636F1103C1016C39:636E1103C1016C39
[15915.823820] block drbd0: Connection closed
[15915.861581] block drbd0: conn( NetworkFailure -> Unconnected )
[15915.861590] block drbd0: receiver terminated
[15915.861593] block drbd0: Restarting receiver thread
[15915.861596] block drbd0: receiver (re)started
[15915.861600] block drbd0: conn( Unconnected -> WFConnection )
[15915.861609] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
[15916.235256] block drbd0: helper command: /sbin/drbdadm fence-peer
minor-0 exit code 5 (0x500)
[15916.235263] block drbd0: fence-peer helper returned 5 (peer is
unreachable, assumed to be dead)
[15916.235274] block drbd0: pdsk( DUnknown -> Outdated )
and as consequence, secondary node up DRBD resource to primary because
it is not marked as outdated.
That behavior is not repeatable every time and sometimes delays between
not arriving PingAck in time and closing connection is not noticeable:
[13572.460018] block drbd0: PingAck did not arrive in time.
[13572.460030] block drbd0: peer( Secondary -> Unknown ) conn( Connected
-> NetworkFailure ) pdsk( UpToDate -> DUnknown )
[13572.460083] block drbd0: new current UUID
BABA4F731623F3BB:7660656B84345C57:4B1F7442F84E5321:4B1E7442F84E5321
[13572.478105] block drbd0: asender terminated
[13572.478109] block drbd0: Terminating asender thread
[13572.486439] block drbd0: Connection closed
[13572.486475] block drbd0: conn( NetworkFailure -> Unconnected )
[13572.486480] block drbd0: receiver terminated
[13572.486482] block drbd0: Restarting receiver thread
[13572.486484] block drbd0: receiver (re)started
[13572.486488] block drbd0: conn( Unconnected -> WFConnection )
[13572.486597] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
[13574.917411] block drbd0: helper command: /sbin/drbdadm fence-peer
minor-0 exit code 4 (0x400)
[13574.917417] block drbd0: fence-peer helper returned 4 (peer was fenced)
[13574.917426] block drbd0: pdsk( DUnknown -> Outdated )
I tried DRBD 8.3.12 but without solving the problem.
What is wrong in my configuration? Should I increase differential
between DRBD and Heartbeat timeouts?
--
Best Regards
Artur Piechocki
Open-E Software Development Department