Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello,
We have set up our first HA 2-node cluster and are currently running
soak tests (private DRBD replication link, private heartbeat link and
2nd heartbeat via DRBD replication link, no dopd, active/passive). We
have experienced
several times that after a fail-over test, e.g. unplugging the DRBD
replication link,
DRBD is in a state of Primary/Unknown, Secondary/Unknown and
will not synchorize. We have to issue several drbdadm commands
like
ha1# drbdadm secondary drbd1
ha2# drbdadm detach drbd1
ha2# drbdadm -- --discard-my-data connect drbd1
which sometimes got it back. Our question is whether this is normal
behaviour of DRBD
to end up in such a state in the first place and what is the recommended
recovery? What is causing this behaviour?
Worst case we actually got was a Secondary/Unkown vs Unconfigured! state after
cutting the power to both nodes.
Is there any timeout to tune to improve this?
Here are some details of our installation:
SuSE 10.3
2.6.22.5-31-default
heartbeat-2.1.3
drbd-8.2.6
ha1:/etc # more drbd.conf
global {
usage-count yes;
}
common {
syncer { rate 10M; }
}
resource drbd1 {
protocol C;
disk {
on-io-error detach;
}
syncer {
rate 10M;
al-extents 257;
}
on ha1 {
device /dev/drbd1;
disk /dev/sdb1;
address 192.168.50.151:7789;
meta-disk internal;
}
on ha2 {
device /dev/drbd1;
disk /dev/sda7;
address 192.168.50.152:7789;
meta-disk internal;
}
}
ha1:/etc # more ha.d/ha.cf
use_logd yes
# eth0 192.168.50 is private heartbeat lan
bcast eth0
# eth1 is private DRBD replication lan (192.168.51)
bcast eth1
keepalive 1
deadtime 10
initdead 30
node ha1 ha2
auto_failback off
respawn hacluster /usr/lib64/heartbeat/ipfail
ping 172.16.1.1 172.16.1.254 172.16.1.245
Thank you for your time
Doro