Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi There, I am currently running a HA environment that consists of the following: -2x Red Hat Enterprise Linux 5.1 ES servers -Both running drbd-8.2.5-3 -Both running heartbeat-2.1.3 -DRBD's replication link is over it's own private network eth1 10.1.1.X connected using a 1GBps switch. -Heartbeats running over the LAN on eth0 192.168.0.XXX unicast - There are two separate HA clusters sharing the same replication switch, as you can see in the logs they are setup to unicast on different ports therefore I would assume this should be fine (Maybe they should be on separate switches or even VLANd? These are both production servers that serve mysql, coldfusion and httpd. I am running into a strange problem where at around 2am most mornings the primary server becomes somewhat unresponsive. By "somewhat" I mean the following: - Can still ping the primary node - Cluster IP address is still up - a cat of /proc/drbd shows the primary and secondary as being in their respective roles (Not failed over). The problem we are facing is that for some strange reason the Primary can no longer be accessed remotely via ssh (even VNC). While at the physical server the console is completely unresponsive, both keyboard and mouse are unresponsive, prompting for a physical shutdown of the server. When the server is shutdown the secondary assumes primary correctly and once the primary is brought back online it joins the cluster and assumes it's respective roll correctly. At the end of this email I will post my config files incase anyone can shed any light on the situation. The log files (/var/log/messages, ha-log, ha-debug) all show no indication on what may be happening. My DRBD.conf file - global { usage-count yes; } common { syncer { rate 500M; } } resource r0 { protocol C; handlers { pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; #outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater"; pri-lost "echo pri-lost. Check the Log Files. | mail -s 'DRBD Alert' root"; #split-brain "echo split-brain. drbdadm -- --discard-my-data connect $DRBD_RESOURCE ? | mail - 'DRBD Alert' root"; } startup { wfc-timeout 30; } disk { fencing resource-only; } net { cram-hmac-alg sha1; shared-secret "FooFunFactory"; } on (FQDN_OF_PRI_NODE) { device /dev/drbd1; disk /dev/sda3; address 10.1.1.2:7789; meta-disk internal; } on (FQDN_OF_SEC_NODE) { device /dev/drbd1; disk /dev/sda3; address 10.1.1.5:7789; meta-disk internal; } } My HA.CF config - debugfile /var/log/ha-debug logfile /var/log/ha-log logfacility local0 keepalive 1 deadtime 10 warntime 5 initdead 120 udpport 6694 bcast eth0 # Linux ucast eth0 (IP_Address_Of_Secondary_Node) auto_failback on node (FQDN_OF_PRI_NODE) (FQDN_OF_SEC_NODE) ping (ROUTER_IP) respawn hacluster /usr/lib/heartbeat/ipfail use_logd yes My HARESOURCES config - (FQDN_OF_PRI) (CLUSTER_IP) drbddisk::r0 Filesystem::/dev/drbd1::/data::ext3 mysqld If anyone has experienced this sort of behaviour before please let me know, I cannot replicate the issue within my testing environment. Any help would be much appreciated. Kind Regards, ___________________________________________________ Adam Taylor | Engineer | WML Software Unit 3c | 14-22 Triton Drive | Albany | Auckland P. +64 9 477 4555 | F. +64 9 478 6926 DDI. +64 9 477 6375 | MOB. +64 21 621 519 E. <mailto:adam.taylor at wml.co.nz> adam.taylor at wml.co.nz W. <http://www.wml.co.nz/> www.wml.co.nz | <http://www.compose.co.nz/> www.compose.co.nz WML Software -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090626/06ad6276/attachment.htm> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 2609 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090626/06ad6276/attachment.gif>