[DRBD-user] Unresponsive Primary node

Fri Jun 26 00:46:01 CEST 2009

Hi There,

I am currently running a HA environment that consists of the following:

-2x Red Hat Enterprise Linux  5.1 ES servers
-Both running drbd-8.2.5-3
-Both running heartbeat-2.1.3 

-DRBD's replication link is over it's own private network eth1 10.1.1.X
connected using a 1GBps switch.
-Heartbeats running over the LAN on eth0 192.168.0.XXX unicast
- There are two separate HA clusters sharing the same replication switch, as
you can see in the logs they are setup to unicast on different ports
therefore I would assume this should be fine (Maybe they should be on
separate switches or even VLANd?  

These are both production servers that serve mysql, coldfusion and httpd.  I
am running into a strange problem where at around 2am most mornings the
primary server becomes somewhat unresponsive.  By "somewhat" I mean the
following:

- Can still ping the primary node
- Cluster IP address is still up
- a cat of /proc/drbd shows the primary and secondary as being in their
respective roles (Not failed over).

The problem we are facing is that for some strange reason the Primary can no
longer be accessed remotely via ssh (even VNC).   While at the physical
server the console is completely unresponsive, both keyboard and mouse are
unresponsive, prompting for a physical shutdown of the server.  When the
server is shutdown the secondary assumes primary correctly and once the
primary is brought back online it joins the cluster and assumes it's
respective roll correctly.  

At the end of this email I will post my config files incase anyone can shed
any light on the situation.  

The log files (/var/log/messages, ha-log, ha-debug) all show no indication
on what may be happening.  

My DRBD.conf file -

global { usage-count yes; }
common { syncer { rate 500M; } }
resource r0 {
            protocol C;
                handlers {
                        pri-on-incon-degr "echo o > /proc/sysrq-trigger ;
halt -f";
                        pri-lost-after-sb "echo o > /proc/sysrq-trigger ;
halt -f";
                        local-io-error "echo o > /proc/sysrq-trigger ; halt
-f";
                        #outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater
-t 5";
                        outdate-peer
"/usr/lib/heartbeat/drbd-peer-outdater";
                        pri-lost "echo pri-lost. Check the Log Files. | mail
-s 'DRBD Alert' root";
                        #split-brain "echo split-brain. drbdadm --
--discard-my-data connect $DRBD_RESOURCE ? | mail - 'DRBD Alert' root";
                }
         startup {
                wfc-timeout 30;
                }
         disk {
                fencing resource-only;
                }
         net {
                 cram-hmac-alg sha1;
                 shared-secret "FooFunFactory";
             }
 on (FQDN_OF_PRI_NODE) {
   device       /dev/drbd1;
   disk         /dev/sda3;
   address      10.1.1.2:7789;
   meta-disk    internal;
 }
 on (FQDN_OF_SEC_NODE)  {
   device       /dev/drbd1;
   disk         /dev/sda3;
   address      10.1.1.5:7789;
   meta-disk    internal;
 }
}

My HA.CF config -

debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility     local0
keepalive 1
deadtime 10
warntime 5
initdead 120
udpport 6694
bcast   eth0            # Linux
ucast eth0 (IP_Address_Of_Secondary_Node)
auto_failback on
node    (FQDN_OF_PRI_NODE)     (FQDN_OF_SEC_NODE)
ping    (ROUTER_IP)
respawn hacluster /usr/lib/heartbeat/ipfail
use_logd yes

My HARESOURCES config - 

(FQDN_OF_PRI)  (CLUSTER_IP) drbddisk::r0 Filesystem::/dev/drbd1::/data::ext3
mysqld

If anyone has experienced this sort of behaviour before please let me know,
I cannot replicate the issue within my testing environment.

Any help would be much appreciated.

Kind Regards,

___________________________________________________		

Adam Taylor  |  Engineer  |  WML Software
Unit 3c  |  14-22 Triton Drive  |  Albany  |  Auckland

P.        +64 9 477 4555   |  F.    +64 9 478 6926
DDI.     +64 9 477 6375   |  MOB.    +64 21 621 519    
E.          <mailto:adam.taylor at wml.co.nz> adam.taylor at wml.co.nz
W.         <http://www.wml.co.nz/> www.wml.co.nz  |
<http://www.compose.co.nz/> www.compose.co.nz

WML Software

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090626/06ad6276/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 2609 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090626/06ad6276/attachment.gif>