[DRBD-user] failing slave node - WFConnection PRIMARY/unkown

Fri Jan 14 11:08:27 CET 2011

Hi guys,

I've got the following questions but couldn't find any answers so far.
Our DRBD cluster works fine so far, failover to the slave node works
properly.
The test scenarios we have tried so far were are all positive. But! (you
probably knew it already ;)
There is one open test case I can't solve. 

=> "What should happen if your slave node fails?" <=

I mean the cluster is kind of "degraded" (like in a RAID array) because
there is no more node which you could failover to, right? If the primary
node fails now, the cluster is dead.
>From my point of view, the administator should be notified that the cluster
is "degraded" and needs to be rectified.
Question 1: is my last statement true?
Question 2: how can this be achieved? can I use the notify script for a
specific handler?

The next problem is that the failing slave node behaves quite strange. The
test case is the following: I just pull the cable from the NIC of the slave
node. After ~ 2min I plug it back in.
The result is the following: (I use the following syntax: Primary Node A /
Slave Node B)

* A primary/connected uptodate |  B secondary/connected uptodate
* pull cable of NIC of B
* A primary/ WFConnection unkown | B primary/ WFConnection unkown
B has no more network connection, logs says "we are dead' but the node gets
primary???
* ~2 minute break
* cable back in B
*  A primary/ connected uptodate | B primary/connected uptodate
   now there are 2 primary nodes, the unresolved splitbrain on node A
reboots A
* splitbrain on A, reboot A
* B secondary/WFConnection unkown
  the slave node gets secondary state!!!!
* after 2 minutes B becomes primary/WFConnection unkown
* after A is back up, it gets secondary
 A secondary/connected uptodate |  B primary/connected uptodate

=> so the cluster is down for about 2 minutes just because "the cleaning
personal" accidentally plugged out the network cable and then it plugged it
back in (who wouldn't)

Question3: why B gets primary if its network is down?
Question4: why the splitbrain situtation is unresolved on A and A gets
rebooted?
Question5: why B gets secondary after A is in reboot stage??? If B would
stay primary the cluster would not be down
Question6: why it takes 2 minutes for B to takeover as primary?

I doubt my configuration is right here!
Sorry I know I have losts of questions hopefully some of you guys are
patient enough to help me out here!
Thanks a lot in advance!

zagi

configuration:
drbd
[
resource drbd0 {
protocol C;

handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
pri-lost-after-sb "/etc/init.d/network stop ; logger SplitBrain problem
detected ; init 6";
local-io-error "echo o > /proc/sysrq-trigger ; halt -f";

disk {
on-io-error detach;
}

net {
after-sb-0pri discard-older-primary;
after-sb-1pri discard-secondary;
after-sb-2pri call-pri-lost-after-sb;
rr-conflict disconnect;
max-buffers 2048;
ko-count 4;
}

syncer {
rate 10M;
al-extents 257;
}

startup {
wfc-timeout 0;
degr-wfc-timeout 120;
}
]

ha.cf
[

logfacility daemon
keepalive 2
deadtime 30
warntime 5
#initdead 120
initdead 60 
udpport 694
ping 10.0.1.11
bcast eth0
auto_failback off
node ha2
node ha1
debug 3
respawn hacluster /usr/lib/heartbeat/ipfail
use_logd yes
logfile /var/log/hb.log
debugfile /var/log/heartbeat-debug.log
]

-- 
View this message in context: http://old.nabble.com/failing-slave-node---WFConnection-PRIMARY-unkown-tp30670274p30670274.html
Sent from the DRBD - User mailing list archive at Nabble.com.