[DRBD-user] Primary/Unkown, Secondary/Unknown state normal behaviour?

Tue Jun 17 09:23:28 CEST 2008

Hello Florian,

thank you for responding.

2008/6/16 Florian Haas <florian.haas at linbit.com>:
> Doro,
>
>> Hello,
>>
>> We have set up our first HA 2-node cluster and are currently running
>> soak tests (private DRBD replication link, private heartbeat link and
>> 2nd heartbeat via DRBD replication link, no dopd, active/passive). We
>> have experienced
>> several times that after a fail-over test, e.g. unplugging the DRBD
>> replication link,
>
> Unplugging the DRBD replication link is not a failover test. What failover
> tests _did_ you run?

Yes, we realise that it is not a failover test, but we want to test recovery
modes before we go into production with the configuration.
Well, we went through a whole set. Powering down the primary, unplugging
the heartbeat links (which in our original configuration included the DRBD
link), unplugging the public network to test ipfail, killing
heartbeat daemons and as part of that we also unplugged just the DRBD
replication link, because we wanted to study the effect it has.
Going over our notes I think we got it into that state when we pulled both
hearbeat links (when going over DRBD and its own link).  Ah, this of course
means that probably DRBD first detected loss of connection and then
HA failed over, which dopd would normally prevent, but we haven't included
dopd yet, see below.
When we just disconnect the DRBD link both sides will synchronise fine.

>
>> DRBD is in a state of Primary/Unknown, Secondary/Unknown and
>> will not synchorize. We have to issue several drbdadm commands
>> like
>> ha1# drbdadm secondary drbd1
>> ha2# drbdadm detach drbd1
>> ha2# drbdadm -- --discard-my-data connect drbd1
>>
>> which sometimes got it back. Our question is whether this is normal
>> behaviour of DRBD
>> to end up in such a state in the first place
>
> No, it is a consequence of misconfigured cluster communication link and/or
> missing outdate-peer functionality. See
> http://www.drbd.org/users-guide/s-outdate.html for some background on
> this.
We'll check that link out. Misconfigured cluster communication link? Do you mean
heartbeat or DRBD? First we ran heartbeat over its own private point-to-point
_and_ the DRBD point-to-point link. Now we run heartbeat over its own and
the public network link. Is that the recommended configuration?

>
>> and what is the recommended recovery?
>
> See http://www.drbd.org/users-guide/s-resolve-split-brain.html
>
So

>> Is there any timeout to tune to improve this?
>
> At this point you don't need to look into timeouts. It may be wise to take
> a peek at http://www.drbd.org/users-guide/s-heartbeat-dopd.html, however.
We looked into that right at the beginning, but when trying it made
matters worse,
the log files reported "refusing to be primary while peer is not outdated".

>
> Cheers,
> Florian
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>

Doro