[DRBD-user] failing slave node - WFConnection PRIMARY/unkown

TrustRanger rs.mehlbox at gmail.com
Mon Jan 17 15:44:25 CET 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi zagiatakrapovic,

do you use heartbeat? You haven't said a word about heartbeat in your
description, but you've posted the heartbeat configuration files.

If yes, which version do you use? The old heartbeat v1 or heartbeat v2
(pacemaker)?

I'm not really sure, but i've heard that you can handle the described case
(if the secondary/slave node is dead) pretty easy with pacemaker/heartbeat
v2.

Kind regards,
TrustRanger


zagiatakrapovic wrote:
> 
> Hi guys,
> 
> I've got the following questions but couldn't find any answers so far.
> Our DRBD cluster works fine so far, failover to the slave node works
> properly.
> The test scenarios we have tried so far were are all positive. But! (you
> probably knew it already ;)
> There is one open test case I can't solve. 
> 
> => "What should happen if your slave node fails?" <=
> 
> I mean the cluster is kind of "degraded" (like in a RAID array) because
> there is no more node which you could failover to, right? If the primary
> node fails now, the cluster is dead.
> From my point of view, the administator should be notified that the
> cluster is "degraded" and needs to be rectified.
> Question 1: is my last statement true?
> Question 2: how can this be achieved? can I use the notify script for a
> specific handler?
> 
> The next problem is that the failing slave node behaves quite strange. The
> test case is the following: I just pull the cable from the NIC of the
> slave node. After ~ 2min I plug it back in.
> The result is the following: (I use the following syntax: Primary Node A /
> Slave Node B)
> 
> * A primary/connected uptodate |  B secondary/connected uptodate
> * pull cable of NIC of B
> * A primary/ WFConnection unkown | B primary/ WFConnection unkown
> B has no more network connection, logs says "we are dead' but the node
> gets primary???
> * ~2 minute break
> * cable back in B
> *  A primary/ connected uptodate | B primary/connected uptodate
>    now there are 2 primary nodes, the unresolved splitbrain on node A
> reboots A
> * splitbrain on A, reboot A
> * B secondary/WFConnection unkown
>   the slave node gets secondary state!!!!
> * after 2 minutes B becomes primary/WFConnection unkown
> * after A is back up, it gets secondary
>  A secondary/connected uptodate |  B primary/connected uptodate
> 
> => so the cluster is down for about 2 minutes just because "the cleaning
> personal" accidentally plugged out the network cable and then it plugged
> it back in (who wouldn't)
> 
> Question3: why B gets primary if its network is down?
> Question4: why the splitbrain situtation is unresolved on A and A gets
> rebooted?
> Question5: why B gets secondary after A is in reboot stage??? If B would
> stay primary the cluster would not be down
> Question6: why it takes 2 minutes for B to takeover as primary?
> 
> I doubt my configuration is right here!
> Sorry I know I have losts of questions hopefully some of you guys are
> patient enough to help me out here!
> Thanks a lot in advance!
> 
> zagi
> 
> configuration:
> drbd
> [
> resource drbd0 {
> protocol C;
> 
> handlers {
> split-brain "/usr/lib/drbd/notify-split-brain.sh root";
> pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
> pri-lost-after-sb "/etc/init.d/network stop ; logger SplitBrain problem
> detected ; init 6";
> local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
> 
> disk {
> on-io-error detach;
> }
> 
> net {
> after-sb-0pri discard-older-primary;
> after-sb-1pri discard-secondary;
> after-sb-2pri call-pri-lost-after-sb;
> rr-conflict disconnect;
> max-buffers 2048;
> ko-count 4;
> }
> 
> syncer {
> rate 10M;
> al-extents 257;
> }
> 
> startup {
> wfc-timeout 0;
> degr-wfc-timeout 120;
> }
> ]
> 
> ha.cf
> [
> 
> logfacility daemon
> keepalive 2
> deadtime 30
> warntime 5
> #initdead 120
> initdead 60 
> udpport 694
> ping 10.0.1.11
> bcast eth0
> auto_failback off
> node ha2
> node ha1
> debug 3
> respawn hacluster /usr/lib/heartbeat/ipfail
> use_logd yes
> logfile /var/log/hb.log
> debugfile /var/log/heartbeat-debug.log
> ]
> 
> 

-- 
View this message in context: http://old.nabble.com/failing-slave-node---WFConnection-PRIMARY-unkown-tp30670274p30676206.html
Sent from the DRBD - User mailing list archive at Nabble.com.




More information about the drbd-user mailing list