Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Shane Swartz wrote: > I'm running DRBD 0.7.4 and Heartbeat 1.2.3 on a cluster running Debian > 2.4. I experience the problem when I simulate a failure on the primary > node by executing a reboot while writing data to the file system > controlled by DRBD. The failover node takes control of all resources as > it should during the reboot, but once the primary node comes up > heartbeat on the failover is killed causing the primary node to take > control again. I have auto_failback set to off. > > When there are no current writes to the file system controlled by DRBD > on the primary node the failover node maintains control of the resources > once the primary comes back up. That is the behavior that I would > expect when auto_failback is set to off. > > I have executed heartbeat standby on the primary while writing to the > file system controlled by DRBD and the failover takes control of the > resources and maintains control when heartbeat on the primary node is > restarted. > > I would like to know if I can get the failover node to maintain control > of all resources when the primary is rebooted no matter if data was > being written to a DRBD controlled file system or not. Is DRBD and > heartbeat functioning as designed or is there a way to get it to > function like I want? > > Below is a portion of the syslog from the failover node. The messages > start from when the rebooted primary node just starts coming back up. > > nodeB kernel: e1000: eth1 NIC Link is Down > nodeB kernel: e1000: eth1 NIC Link is Up 100 Mbps Full Duplex > nodeB ccm[25181]: info: ccm_joining_to_joined: cookie changed > nodeB kernel: e1000: eth1 NIC Link is Down > nodeB kernel: e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex > nodeB heartbeat[25160]: WARN: node nodeA: is dead > nodeB heartbeat[25160]: info: Dead node nodeA gave up resources. > nodeB heartbeat[25160]: info: Link nodeA:/dev/ttyS0 dead. > nodeB heartbeat[25160]: info: Link nodeA:eth1 dead. > nodeB ipfail[25182]: info: Link Status update: Link nodeA//dev/ttyS0 now > has status dead > nodeB ipfail[25182]: debug: Found ping node group_one! > nodeB ipfail[25182]: debug: Found ping node group_two! > nodeB ipfail[25182]: info: Asking other side for ping node count. > nodeB ipfail[25182]: debug: Message [num_ping] sent. > nodeB ipfail[25182]: info: Checking remote count of ping nodes. > nodeB ipfail[25182]: info: Link Status update: Link nodeA/eth1 now has > status dead > nodeB ipfail[25182]: debug: Found ping node group_one! > nodeB ipfail[25182]: debug: Found ping node group_two! > nodeB ipfail[25182]: info: Asking other side for ping node count. > nodeB ipfail[25182]: debug: Message [num_ping] sent. > nodeB ipfail[25182]: info: Checking remote count of ping nodes. > nodeB kernel: e1000: eth1 NIC Link is Down > nodeB kernel: e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex > nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFConnection --> > WFReportParams > nodeB kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 > nodeB kernel: drbd0: Connection established. > nodeB kernel: drbd0: I am(P): 1:00000002:00000001:0000006e:00000002:10 > nodeB kernel: drbd0: Peer(S): 1:00000002:00000001:0000006d:00000002:01 > nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFReportParams --> > WFBitMapS > nodeB kernel: drbd0: Primary/Unknown --> Primary/Secondary > nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFBitMapS --> > SyncSource > nodeB kernel: drbd0: Resync started as SyncSource (need to sync 20 KB [5 > bits set]). > nodeB kernel: drbd0: Resync done (total 2 sec; paused 0 sec; 8 K/sec) > nodeB kernel: drbd0: drbd0_worker [25514]: cstate SyncSource --> Connected > nodeB heartbeat[25160]: info: Heartbeat restart on node nodeA > nodeB heartbeat[25160]: info: Link nodeA:eth1 up. > nodeB heartbeat[25160]: info: Status update for node nodeA: status up > nodeB ipfail[25182]: info: Link Status update: Link nodeA/eth1 now has > status up > nodeB ipfail[25182]: info: Status update: Node nodeA now has status up > nodeB heartbeat[25521]: debug: notify_world: setting SIGCHLD Handler to > SIG_DFL > nodeB heartbeat: info: Running /etc/ha.d/rc.d/status status > nodeB heartbeat[25165]: ERROR: read_child send: RCs: 1 0 This isn't right. The word ERROR should never appear in any heartbeat logs. This may be because of the read_child() process dying below. Too bad you stripped off all the timestamps... :-( [It is more than possible that heartbeat would get this message before noticing that it's child had died). > nodeB heartbeat[25160]: info: Status update for node nodeA: status active > nodeB heartbeat[25525]: debug: notify_world: setting SIGCHLD Handler to > SIG_DFL > nodeB ipfail[25182]: info: Status update: Node nodeA now has status active > nodeB heartbeat[25160]: info: remote resource transition completed. > nodeB ccm[25181]: debug: received message resource orig=nodeB > nodeB ipfail[25182]: debug: Other side is unstable. > nodeB ipfail[25182]: debug: Other side is now stable. > nodeB heartbeat: info: Running /etc/ha.d/rc.d/status status > nodeB heartbeat[25160]: ERROR: Exiting HBREAD process 25165 killed by > signal 11. Hmmm... Version 1.2.3, and a read_child() process killed by SIGSEGV. That's definitely not good... Did you perchance shut down the network without shutting down heartbeat? Getting a core dump from a read_child() is a little tricky since it runs as nobody with an unwritable curdir... Here's what you have to do: start heartbeat with ulimit -c unlimited chmod 777 /etc/ha.d (yes, it sucks. Sorry.) Then reproduce the problem that causes the SIGSEGV You should now have a nice core file in /etc/ha.d/ If you can't get this going, we can meet for lunch and work this out - assuming you're at UCAR in Boulder. -- Alan Robertson <alanr at unix.sh> "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce