[DRBD-user] DRBD + Heartbeat + timeouts

Thu Apr 12 16:07:03 CEST 2012

Hello,

I am testing DRBD 8.3.10 with Heartbeat 2.1.4 configuration and occurs
some problem connected with DRBD and Heartbeat timeouts. It looks like
something wrong with fencing mechanism or closing connection by DRBD.

Please have a look at my configuration:

drbdsetup 0 show
disk {
        size                    0s _is_default; # bytes
        on-io-error             call-local-io-error;
        fencing                 resource-only;
        max-bio-bvecs           0 _is_default;
}
net {
        timeout                 60 _is_default; # 1/10 seconds
        max-epoch-size          2048 _is_default;
        max-buffers             2048 _is_default;
        unplug-watermark        128 _is_default;
        connect-int             8; # seconds
        ping-int                7; # seconds
        sndbuf-size             0 _is_default; # bytes
        rcvbuf-size             0 _is_default; # bytes
        ko-count                0 _is_default;
        after-sb-0pri           disconnect _is_default;
        after-sb-1pri           disconnect _is_default;
        after-sb-2pri           disconnect _is_default;
        rr-conflict             disconnect _is_default;
        ping-timeout            5 _is_default; # 1/10 seconds
        on-congestion           block _is_default;
        congestion-fill         0s _is_default; # byte
        congestion-extents      127 _is_default;
}
syncer {
        rate                    40960k; # bytes/second
        after                   -1 _is_default;
        al-extents              127 _is_default;
        on-no-data-accessible   io-error _is_default;
        c-plan-ahead            0 _is_default; # 1/10 seconds
        c-delay-target          10 _is_default; # 1/10 seconds
        c-fill-target           0s _is_default; # bytes
        c-max-rate              102400k _is_default; # bytes/second
        c-min-rate              40960k; # bytes/second
}
protocol C;
_this_host {
        device                  minor 0;
        device                  "/dev/RA3nr2_drbd";
        disk                    "/dev/vg+vg00/lv0000";
        meta-disk               "/dev/vg+vg00/RA3nr2" [ 0 ];
        address                 ipv4 77.77.77.2:12000;
}
_remote_host {
        address                 ipv4 77.77.77.5:12000;

Heartbeat configuration:

cat /etc/ha.d/ha.cf
debug 0
use_logd off
debugfile /mnt/hda1/log/ha-debug
logfile /mnt/hda1/log/ha-log
logfacility local0
uuidfrom nodename
udpport 694
auto_failback off
node 23751186 57017938
warntime 5000ms
deadtime 10000ms
initdead 15000ms
keepalive 1000ms
ping_group default_group 77.77.77.89
respawn hacluster /usr/lib/heartbeat/ipfail
respawn hacluster /usr/lib/heartbeat/dopd
apiauth dopd gid=haclient uid=hacluster
ucast eth1 192.168.248.176
ucast eth2 77.77.77.5

so DBRD timeouts are lower than deadtime Heartbeat timeout.

After unplugging replication link and after about 7 seconds dmesg show
information about not arriving PingAck in time

[15904.320048] block drbd0: PingAck did not arrive in time.
[15904.320062] block drbd0: peer( Secondary -> Unknown ) conn( Connected
-> NetworkFailure ) pdsk( UpToDate -> DUnknown )
[15904.328676] block drbd0: asender terminated
[15904.328684] block drbd0: Terminating asender thread

Here is 11 sec delays and marking resource peer as outdated is too late

[15915.810087] block drbd0: new current UUID
D9FCA453B58FC9FD:17A9D53C5618F549:636F1103C1016C39:636E1103C1016C39
[15915.823820] block drbd0: Connection closed
[15915.861581] block drbd0: conn( NetworkFailure -> Unconnected )
[15915.861590] block drbd0: receiver terminated
[15915.861593] block drbd0: Restarting receiver thread
[15915.861596] block drbd0: receiver (re)started
[15915.861600] block drbd0: conn( Unconnected -> WFConnection )
[15915.861609] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
[15916.235256] block drbd0: helper command: /sbin/drbdadm fence-peer
minor-0 exit code 5 (0x500)
[15916.235263] block drbd0: fence-peer helper returned 5 (peer is
unreachable, assumed to be dead)
[15916.235274] block drbd0: pdsk( DUnknown -> Outdated )

and as consequence, secondary node up DRBD resource to primary because
it is not marked as outdated.

That behavior is not repeatable every time and sometimes delays between
not arriving PingAck in time and closing connection is not noticeable:

[13572.460018] block drbd0: PingAck did not arrive in time.
[13572.460030] block drbd0: peer( Secondary -> Unknown ) conn( Connected
-> NetworkFailure ) pdsk( UpToDate -> DUnknown )
[13572.460083] block drbd0: new current UUID
BABA4F731623F3BB:7660656B84345C57:4B1F7442F84E5321:4B1E7442F84E5321
[13572.478105] block drbd0: asender terminated
[13572.478109] block drbd0: Terminating asender thread
[13572.486439] block drbd0: Connection closed
[13572.486475] block drbd0: conn( NetworkFailure -> Unconnected )
[13572.486480] block drbd0: receiver terminated
[13572.486482] block drbd0: Restarting receiver thread
[13572.486484] block drbd0: receiver (re)started
[13572.486488] block drbd0: conn( Unconnected -> WFConnection )
[13572.486597] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
[13574.917411] block drbd0: helper command: /sbin/drbdadm fence-peer
minor-0 exit code 4 (0x400)
[13574.917417] block drbd0: fence-peer helper returned 4 (peer was fenced)
[13574.917426] block drbd0: pdsk( DUnknown -> Outdated )

I tried DRBD 8.3.12 but without solving the problem.

What is wrong in my configuration? Should I increase differential
between DRBD and Heartbeat timeouts?

-- 
Best Regards
Artur Piechocki
Open-E Software Development Department