Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello Florian, thank you for responding. 2008/6/16 Florian Haas <florian.haas at linbit.com>: > Doro, > >> Hello, >> >> We have set up our first HA 2-node cluster and are currently running >> soak tests (private DRBD replication link, private heartbeat link and >> 2nd heartbeat via DRBD replication link, no dopd, active/passive). We >> have experienced >> several times that after a fail-over test, e.g. unplugging the DRBD >> replication link, > > Unplugging the DRBD replication link is not a failover test. What failover > tests _did_ you run? Yes, we realise that it is not a failover test, but we want to test recovery modes before we go into production with the configuration. Well, we went through a whole set. Powering down the primary, unplugging the heartbeat links (which in our original configuration included the DRBD link), unplugging the public network to test ipfail, killing heartbeat daemons and as part of that we also unplugged just the DRBD replication link, because we wanted to study the effect it has. Going over our notes I think we got it into that state when we pulled both hearbeat links (when going over DRBD and its own link). Ah, this of course means that probably DRBD first detected loss of connection and then HA failed over, which dopd would normally prevent, but we haven't included dopd yet, see below. When we just disconnect the DRBD link both sides will synchronise fine. > >> DRBD is in a state of Primary/Unknown, Secondary/Unknown and >> will not synchorize. We have to issue several drbdadm commands >> like >> ha1# drbdadm secondary drbd1 >> ha2# drbdadm detach drbd1 >> ha2# drbdadm -- --discard-my-data connect drbd1 >> >> which sometimes got it back. Our question is whether this is normal >> behaviour of DRBD >> to end up in such a state in the first place > > No, it is a consequence of misconfigured cluster communication link and/or > missing outdate-peer functionality. See > http://www.drbd.org/users-guide/s-outdate.html for some background on > this. We'll check that link out. Misconfigured cluster communication link? Do you mean heartbeat or DRBD? First we ran heartbeat over its own private point-to-point _and_ the DRBD point-to-point link. Now we run heartbeat over its own and the public network link. Is that the recommended configuration? > >> and what is the recommended recovery? > > See http://www.drbd.org/users-guide/s-resolve-split-brain.html > So >> Is there any timeout to tune to improve this? > > At this point you don't need to look into timeouts. It may be wise to take > a peek at http://www.drbd.org/users-guide/s-heartbeat-dopd.html, however. We looked into that right at the beginning, but when trying it made matters worse, the log files reported "refusing to be primary while peer is not outdated". > > Cheers, > Florian > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > Doro