<div>Hi Adam,</div>
<div> </div>
<div>I have not heard anything like this so far. I would say it is network-related issue. As heartbeat triggers failover only when you shutdown your primary, it can still communicate using ICMP. Can you see warnings that eth0 is inaccessible? Can you connect to the blocked primary using ssh via eth1 from the standby? Are your processes accessing DRBD resources blocked? Is there a promtp for shutdown but nothing related in log files? May be you will identify something.</div>
<div> </div>
<div>Try to look at linux-ha list for issues when running two heartbeat instances on the same system. If anything will help, I would suggest to update to at least:</div>
<div>DRBD 8.2.7 (although the latest is always advised)</div>
<div>heartbeat 2.1.4<br></div>
<div>With regards,</div>
<div> </div>
<div>Tino</div>
<div> </div>
<div class="gmail_quote">2009/6/25 Adam Taylor <span dir="ltr"><<a href="mailto:adam.taylor@wml.co.nz">adam.taylor@wml.co.nz</a>></span><br>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">
<div>
<div><font face="Arial" size="2"><span>Hi There,</span></font></div>
<div><font face="Arial" size="2"><span></span></font> </div>
<div><font face="Arial" size="2"><span>I am currently running a HA environment that consists of the following:</span></font></div>
<div><font face="Arial" size="2"><span></span></font> </div>
<div><font face="Arial" size="2"><span>-2x Red Hat Enterprise Linux 5.1 ES servers</span></font></div>
<div><font face="Arial" size="2"><span>-Both running drbd-8.2.5-3<br>-Both running heartbeat-2.1.3 </span></font></div>
<div><font face="Arial" size="2"><span></span></font> </div>
<div><font face="Arial" size="2"><span>-DRBD's replication link is over it's own private network eth1 10.1.1.X connected using a 1GBps switch.</span></font></div>
<div><font face="Arial" size="2"><span>-Heartbeats running over the LAN on eth0 192.168.0.XXX unicast</span></font></div>
<div><font face="Arial" size="2"><span>- There are two separate HA clusters sharing the same replication switch, as you can see in the logs they are setup to unicast on different ports therefore I would assume this should be fine (Maybe they should be on separate switches or even VLANd? </span></font></div>
<div><font face="Arial" size="2"></font> </div>
<div><span><font face="Arial" size="2">These are both production servers that serve mysql, coldfusion and httpd. I am running into a strange problem where at around 2am most mornings the primary server becomes somewhat unresponsive. By "somewhat" I mean the following:</font></span></div>
<div><span><font face="Arial" size="2"></font></span> </div>
<div><span><font face="Arial" size="2">- Can still ping the primary node</font></span></div>
<div><span><font face="Arial" size="2">- Cluster IP address is still up</font></span></div>
<div><span><font face="Arial" size="2">- a cat of /proc/drbd shows the primary and secondary as being in their respective roles (Not failed over).</font></span></div>
<div><span><font face="Arial" size="2"></font></span> </div>
<div><span><font face="Arial" size="2">The problem we are facing is that for some strange reason the Primary can no longer be accessed remotely via ssh (even VNC). While at the physical server the console is completely unresponsive, both keyboard and mouse are unresponsive, prompting for a physical shutdown of the server. When the server is shutdown the secondary assumes primary correctly and once the primary is brought back online it joins the cluster and assumes it's respective roll correctly. </font></span></div>
<div><span><font face="Arial" size="2"></font></span> </div></div></blockquote></div>