Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi guys, I've got the following questions but couldn't find any answers so far. Our DRBD cluster works fine so far, failover to the slave node works properly. The test scenarios we have tried so far were are all positive. But! (you probably knew it already ;) There is one open test case I can't solve. => "What should happen if your slave node fails?" <= I mean the cluster is kind of "degraded" (like in a RAID array) because there is no more node which you could failover to, right? If the primary node fails now, the cluster is dead. >From my point of view, the administator should be notified that the cluster is "degraded" and needs to be rectified. Question 1: is my last statement true? Question 2: how can this be achieved? can I use the notify script for a specific handler? The next problem is that the failing slave node behaves quite strange. The test case is the following: I just pull the cable from the NIC of the slave node. After ~ 2min I plug it back in. The result is the following: (I use the following syntax: Primary Node A / Slave Node B) * A primary/connected uptodate | B secondary/connected uptodate * pull cable of NIC of B * A primary/ WFConnection unkown | B primary/ WFConnection unkown B has no more network connection, logs says "we are dead' but the node gets primary??? * ~2 minute break * cable back in B * A primary/ connected uptodate | B primary/connected uptodate now there are 2 primary nodes, the unresolved splitbrain on node A reboots A * splitbrain on A, reboot A * B secondary/WFConnection unkown the slave node gets secondary state!!!! * after 2 minutes B becomes primary/WFConnection unkown * after A is back up, it gets secondary A secondary/connected uptodate | B primary/connected uptodate => so the cluster is down for about 2 minutes just because "the cleaning personal" accidentally plugged out the network cable and then it plugged it back in (who wouldn't) Question3: why B gets primary if its network is down? Question4: why the splitbrain situtation is unresolved on A and A gets rebooted? Question5: why B gets secondary after A is in reboot stage??? If B would stay primary the cluster would not be down Question6: why it takes 2 minutes for B to takeover as primary? I doubt my configuration is right here! Sorry I know I have losts of questions hopefully some of you guys are patient enough to help me out here! Thanks a lot in advance! zagi configuration: drbd [ resource drbd0 { protocol C; handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh root"; pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "/etc/init.d/network stop ; logger SplitBrain problem detected ; init 6"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; disk { on-io-error detach; } net { after-sb-0pri discard-older-primary; after-sb-1pri discard-secondary; after-sb-2pri call-pri-lost-after-sb; rr-conflict disconnect; max-buffers 2048; ko-count 4; } syncer { rate 10M; al-extents 257; } startup { wfc-timeout 0; degr-wfc-timeout 120; } ] ha.cf [ logfacility daemon keepalive 2 deadtime 30 warntime 5 #initdead 120 initdead 60 udpport 694 ping 10.0.1.11 bcast eth0 auto_failback off node ha2 node ha1 debug 3 respawn hacluster /usr/lib/heartbeat/ipfail use_logd yes logfile /var/log/hb.log debugfile /var/log/heartbeat-debug.log ] -- View this message in context: http://old.nabble.com/failing-slave-node---WFConnection-PRIMARY-unkown-tp30670274p30670274.html Sent from the DRBD - User mailing list archive at Nabble.com.