Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, I am testing DRBD 8.3.10 with Heartbeat 2.1.4 configuration and occurs some problem connected with DRBD and Heartbeat timeouts. It looks like something wrong with fencing mechanism or closing connection by DRBD. Please have a look at my configuration: drbdsetup 0 show disk { size 0s _is_default; # bytes on-io-error call-local-io-error; fencing resource-only; max-bio-bvecs 0 _is_default; } net { timeout 60 _is_default; # 1/10 seconds max-epoch-size 2048 _is_default; max-buffers 2048 _is_default; unplug-watermark 128 _is_default; connect-int 8; # seconds ping-int 7; # seconds sndbuf-size 0 _is_default; # bytes rcvbuf-size 0 _is_default; # bytes ko-count 0 _is_default; after-sb-0pri disconnect _is_default; after-sb-1pri disconnect _is_default; after-sb-2pri disconnect _is_default; rr-conflict disconnect _is_default; ping-timeout 5 _is_default; # 1/10 seconds on-congestion block _is_default; congestion-fill 0s _is_default; # byte congestion-extents 127 _is_default; } syncer { rate 40960k; # bytes/second after -1 _is_default; al-extents 127 _is_default; on-no-data-accessible io-error _is_default; c-plan-ahead 0 _is_default; # 1/10 seconds c-delay-target 10 _is_default; # 1/10 seconds c-fill-target 0s _is_default; # bytes c-max-rate 102400k _is_default; # bytes/second c-min-rate 40960k; # bytes/second } protocol C; _this_host { device minor 0; device "/dev/RA3nr2_drbd"; disk "/dev/vg+vg00/lv0000"; meta-disk "/dev/vg+vg00/RA3nr2" [ 0 ]; address ipv4 77.77.77.2:12000; } _remote_host { address ipv4 77.77.77.5:12000; Heartbeat configuration: cat /etc/ha.d/ha.cf debug 0 use_logd off debugfile /mnt/hda1/log/ha-debug logfile /mnt/hda1/log/ha-log logfacility local0 uuidfrom nodename udpport 694 auto_failback off node 23751186 57017938 warntime 5000ms deadtime 10000ms initdead 15000ms keepalive 1000ms ping_group default_group 77.77.77.89 respawn hacluster /usr/lib/heartbeat/ipfail respawn hacluster /usr/lib/heartbeat/dopd apiauth dopd gid=haclient uid=hacluster ucast eth1 192.168.248.176 ucast eth2 77.77.77.5 so DBRD timeouts are lower than deadtime Heartbeat timeout. After unplugging replication link and after about 7 seconds dmesg show information about not arriving PingAck in time [15904.320048] block drbd0: PingAck did not arrive in time. [15904.320062] block drbd0: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) [15904.328676] block drbd0: asender terminated [15904.328684] block drbd0: Terminating asender thread Here is 11 sec delays and marking resource peer as outdated is too late [15915.810087] block drbd0: new current UUID D9FCA453B58FC9FD:17A9D53C5618F549:636F1103C1016C39:636E1103C1016C39 [15915.823820] block drbd0: Connection closed [15915.861581] block drbd0: conn( NetworkFailure -> Unconnected ) [15915.861590] block drbd0: receiver terminated [15915.861593] block drbd0: Restarting receiver thread [15915.861596] block drbd0: receiver (re)started [15915.861600] block drbd0: conn( Unconnected -> WFConnection ) [15915.861609] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 [15916.235256] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 5 (0x500) [15916.235263] block drbd0: fence-peer helper returned 5 (peer is unreachable, assumed to be dead) [15916.235274] block drbd0: pdsk( DUnknown -> Outdated ) and as consequence, secondary node up DRBD resource to primary because it is not marked as outdated. That behavior is not repeatable every time and sometimes delays between not arriving PingAck in time and closing connection is not noticeable: [13572.460018] block drbd0: PingAck did not arrive in time. [13572.460030] block drbd0: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) [13572.460083] block drbd0: new current UUID BABA4F731623F3BB:7660656B84345C57:4B1F7442F84E5321:4B1E7442F84E5321 [13572.478105] block drbd0: asender terminated [13572.478109] block drbd0: Terminating asender thread [13572.486439] block drbd0: Connection closed [13572.486475] block drbd0: conn( NetworkFailure -> Unconnected ) [13572.486480] block drbd0: receiver terminated [13572.486482] block drbd0: Restarting receiver thread [13572.486484] block drbd0: receiver (re)started [13572.486488] block drbd0: conn( Unconnected -> WFConnection ) [13572.486597] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 [13574.917411] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 4 (0x400) [13574.917417] block drbd0: fence-peer helper returned 4 (peer was fenced) [13574.917426] block drbd0: pdsk( DUnknown -> Outdated ) I tried DRBD 8.3.12 but without solving the problem. What is wrong in my configuration? Should I increase differential between DRBD and Heartbeat timeouts? -- Best Regards Artur Piechocki Open-E Software Development Department