[DRBD-user] Unresponsive Primary node

Tue Jun 30 14:27:40 CEST 2009

Hi Adam,

I have not heard anything like this so far. I would say it is
network-related issue. As heartbeat triggers failover only when you shutdown
your primary, it can still communicate using ICMP. Can you see warnings that
eth0 is inaccessible? Can you connect to the blocked primary using ssh via
eth1 from the standby? Are your processes accessing DRBD resources blocked?
Is there a promtp for shutdown but nothing related in log files? May be you
will identify something.

Try to look at linux-ha list for issues when running two heartbeat instances
on the same system. If anything will help, I would suggest to update to at
least:
DRBD 8.2.7 (although the latest is always advised)
heartbeat 2.1.4
With regards,

Tino

2009/6/25 Adam Taylor <adam.taylor at wml.co.nz>

>  Hi There,
>
> I am currently running a HA environment that consists of the following:
>
> -2x Red Hat Enterprise Linux  5.1 ES servers
> -Both running drbd-8.2.5-3
> -Both running heartbeat-2.1.3
>
> -DRBD's replication link is over it's own private network eth1 10.1.1.X
> connected using a 1GBps switch.
> -Heartbeats running over the LAN on eth0 192.168.0.XXX unicast
> - There are two separate HA clusters sharing the same replication switch,
> as you can see in the logs they are setup to unicast on different ports
> therefore I would assume this should be fine (Maybe they should be on
> separate switches or even VLANd?
>
> These are both production servers that serve mysql, coldfusion and httpd.
> I am running into a strange problem where at around 2am most mornings the
> primary server becomes somewhat unresponsive.  By "somewhat" I mean the
> following:
>
> - Can still ping the primary node
> - Cluster IP address is still up
> - a cat of /proc/drbd shows the primary and secondary as being in their
> respective roles (Not failed over).
>
> The problem we are facing is that for some strange reason the Primary can
> no longer be accessed remotely via ssh (even VNC).   While at the physical
> server the console is completely unresponsive, both keyboard and mouse are
> unresponsive, prompting for a physical shutdown of the server.  When the
> server is shutdown the secondary assumes primary correctly and once the
> primary is brought back online it joins the cluster and assumes it's
> respective roll correctly.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090630/55366440/attachment.htm>