[DRBD-user] Primary/Unkown, Secondary/Unknown state normal behaviour?

Sat Jun 14 01:18:11 CEST 2008

Hello,

We have set up our first HA 2-node cluster and are currently running
soak tests (private DRBD replication link, private heartbeat link and
2nd heartbeat via DRBD replication link, no dopd, active/passive). We
have experienced
several times that after a fail-over test, e.g. unplugging the DRBD
replication link,
DRBD is in a state of Primary/Unknown, Secondary/Unknown and
will not synchorize. We have to issue several drbdadm commands
like
ha1# drbdadm secondary drbd1
ha2# drbdadm detach drbd1
ha2# drbdadm -- --discard-my-data connect drbd1

which sometimes got it back. Our question is whether this is normal
behaviour of DRBD
to end up in such a state in the first place and what is the recommended
recovery? What is causing this behaviour?

Worst case we actually got was a Secondary/Unkown vs Unconfigured! state after
cutting the power to both nodes.

Is there any timeout to tune to improve this?

Here are some details of our installation:

SuSE 10.3
2.6.22.5-31-default
heartbeat-2.1.3
drbd-8.2.6

ha1:/etc # more drbd.conf
global {
        usage-count yes;
        }
common  {
        syncer { rate 10M; }
        }
resource drbd1  {
        protocol C;
        disk    {
                on-io-error     detach;
                }
        syncer  {
                rate 10M;
                al-extents 257;
                }
        on ha1      {
                device          /dev/drbd1;
                disk            /dev/sdb1;
                address         192.168.50.151:7789;
                meta-disk       internal;
                        }
        on ha2      {
                device          /dev/drbd1;
                disk            /dev/sda7;
                address         192.168.50.152:7789;
                meta-disk       internal;
                        }
}

ha1:/etc # more ha.d/ha.cf
use_logd yes
# eth0 192.168.50 is private heartbeat lan
bcast eth0
# eth1 is private DRBD replication lan (192.168.51)
bcast eth1
keepalive 1
deadtime 10
initdead 30
node ha1 ha2
auto_failback off
respawn hacluster /usr/lib64/heartbeat/ipfail
ping 172.16.1.1 172.16.1.254 172.16.1.245

Thank you for your time

Doro