Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, We have set up our first HA 2-node cluster and are currently running soak tests (private DRBD replication link, private heartbeat link and 2nd heartbeat via DRBD replication link, no dopd, active/passive). We have experienced several times that after a fail-over test, e.g. unplugging the DRBD replication link, DRBD is in a state of Primary/Unknown, Secondary/Unknown and will not synchorize. We have to issue several drbdadm commands like ha1# drbdadm secondary drbd1 ha2# drbdadm detach drbd1 ha2# drbdadm -- --discard-my-data connect drbd1 which sometimes got it back. Our question is whether this is normal behaviour of DRBD to end up in such a state in the first place and what is the recommended recovery? What is causing this behaviour? Worst case we actually got was a Secondary/Unkown vs Unconfigured! state after cutting the power to both nodes. Is there any timeout to tune to improve this? Here are some details of our installation: SuSE 10.3 2.6.22.5-31-default heartbeat-2.1.3 drbd-8.2.6 ha1:/etc # more drbd.conf global { usage-count yes; } common { syncer { rate 10M; } } resource drbd1 { protocol C; disk { on-io-error detach; } syncer { rate 10M; al-extents 257; } on ha1 { device /dev/drbd1; disk /dev/sdb1; address 192.168.50.151:7789; meta-disk internal; } on ha2 { device /dev/drbd1; disk /dev/sda7; address 192.168.50.152:7789; meta-disk internal; } } ha1:/etc # more ha.d/ha.cf use_logd yes # eth0 192.168.50 is private heartbeat lan bcast eth0 # eth1 is private DRBD replication lan (192.168.51) bcast eth1 keepalive 1 deadtime 10 initdead 30 node ha1 ha2 auto_failback off respawn hacluster /usr/lib64/heartbeat/ipfail ping 172.16.1.1 172.16.1.254 172.16.1.245 Thank you for your time Doro