[DRBD-user] If eth0 is down,the procedure of takeover doesn't work

Tue Jun 19 09:07:04 CEST 2007

Hi everyone,
I'm running a cluster (of 2 nodes) with Drbd and Heartbeat,
I ve done a lot of test and all worked out properly, but I can't
understand why the most simple of them doesn't work.

The problem is that when I pull off the plug of interface eth0
(public) of the primary the secondary for some reason can't start the
takeover procedure.First of all here is my ha.cf file and my drbd.conf
file:
ha.cf

logfile /var/log/ha-log
keepalive 2
deadtime 60
initdead 120
bcast eth1
bcast eth0
serial /dev/ttyS0
auto_failback on
node host1 host2
respawn root /etc/init.d/apache2
respawn root /etc/init.d/postgresql
respawn root /usr/lib/heartbeat/ipfail

ping (#ip addr gateway)

drbd.conf

resource r0 {
protocol C;
net {
   timeout 15;
}
syncer {
group 0;
rate 5M;
}
on host1 {
device /dev/drbd0;
disk /dev/hda5;
address 192.168.0.1:7788;
meta-disk internal;
}
on host2 {
device /dev/drbd0;
disk /dev/hda5;
address 192.168.0.2:7788;
meta-disk internal;
}
}

resource r1 {
protocol C;
net {
   timeout 15;
}
syncer {
group 1;
rate 5M;
}

on host1 {
device /dev/drbd1;
disk /dev/hdc1;
address 192.168.0.1:7789;
meta-disk internal;
}
on host2 {
device /dev/drbd1;
disk /dev/hda6;
address 192.168.0.2:7789;
meta-disk internal;
}
meta-disk internal;
}
}

The fact is that,
-if i plug off both the serial and the eth1,and then I stop heartbeat
on the primary,the takeover takes effect correctly

-If i plug off eth0 they both go in a stat of "standby":none of them
is working but,anlizing the log file (I don't report it,too long..) I
see that:
1)on the primary the takeover procedure has started for the
secondary,but i have a warning that the second node is down (while it
is up!!) and the primary goes on having all the resources (for example
the IP address)
2) on the secondary I have a message :
"Both nodes own our resources"

In the end if i plug again eth0 the strange thing is that the takeover
has effect (all resources goes on the secondary) and then , for the
"auto_failback on" they all comes back to the primary...

I thought it was a problem of timeout also,but that shouldn't even
work when i switch off heartbeat manually  (I think) ,while it works
if i do this..

I think I've config properly everything (ipfail in particular..)
what's wrong then?should I avoid to send broadcast on eth0?
Thanks in advance for those that can let me understand...