Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Shane Swartz wrote:
> I'm running DRBD 0.7.4 and Heartbeat 1.2.3 on a cluster running Debian
> 2.4. I experience the problem when I simulate a failure on the primary
> node by executing a reboot while writing data to the file system
> controlled by DRBD. The failover node takes control of all resources as
> it should during the reboot, but once the primary node comes up
> heartbeat on the failover is killed causing the primary node to take
> control again. I have auto_failback set to off.
>
> When there are no current writes to the file system controlled by DRBD
> on the primary node the failover node maintains control of the resources
> once the primary comes back up. That is the behavior that I would
> expect when auto_failback is set to off.
>
> I have executed heartbeat standby on the primary while writing to the
> file system controlled by DRBD and the failover takes control of the
> resources and maintains control when heartbeat on the primary node is
> restarted.
>
> I would like to know if I can get the failover node to maintain control
> of all resources when the primary is rebooted no matter if data was
> being written to a DRBD controlled file system or not. Is DRBD and
> heartbeat functioning as designed or is there a way to get it to
> function like I want?
>
> Below is a portion of the syslog from the failover node. The messages
> start from when the rebooted primary node just starts coming back up.
>
> nodeB kernel: e1000: eth1 NIC Link is Down
> nodeB kernel: e1000: eth1 NIC Link is Up 100 Mbps Full Duplex
> nodeB ccm[25181]: info: ccm_joining_to_joined: cookie changed
> nodeB kernel: e1000: eth1 NIC Link is Down
> nodeB kernel: e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex
> nodeB heartbeat[25160]: WARN: node nodeA: is dead
> nodeB heartbeat[25160]: info: Dead node nodeA gave up resources.
> nodeB heartbeat[25160]: info: Link nodeA:/dev/ttyS0 dead.
> nodeB heartbeat[25160]: info: Link nodeA:eth1 dead.
> nodeB ipfail[25182]: info: Link Status update: Link nodeA//dev/ttyS0 now
> has status dead
> nodeB ipfail[25182]: debug: Found ping node group_one!
> nodeB ipfail[25182]: debug: Found ping node group_two!
> nodeB ipfail[25182]: info: Asking other side for ping node count.
> nodeB ipfail[25182]: debug: Message [num_ping] sent.
> nodeB ipfail[25182]: info: Checking remote count of ping nodes.
> nodeB ipfail[25182]: info: Link Status update: Link nodeA/eth1 now has
> status dead
> nodeB ipfail[25182]: debug: Found ping node group_one!
> nodeB ipfail[25182]: debug: Found ping node group_two!
> nodeB ipfail[25182]: info: Asking other side for ping node count.
> nodeB ipfail[25182]: debug: Message [num_ping] sent.
> nodeB ipfail[25182]: info: Checking remote count of ping nodes.
> nodeB kernel: e1000: eth1 NIC Link is Down
> nodeB kernel: e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex
> nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFConnection -->
> WFReportParams
> nodeB kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
> nodeB kernel: drbd0: Connection established.
> nodeB kernel: drbd0: I am(P): 1:00000002:00000001:0000006e:00000002:10
> nodeB kernel: drbd0: Peer(S): 1:00000002:00000001:0000006d:00000002:01
> nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFReportParams -->
> WFBitMapS
> nodeB kernel: drbd0: Primary/Unknown --> Primary/Secondary
> nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFBitMapS -->
> SyncSource
> nodeB kernel: drbd0: Resync started as SyncSource (need to sync 20 KB [5
> bits set]).
> nodeB kernel: drbd0: Resync done (total 2 sec; paused 0 sec; 8 K/sec)
> nodeB kernel: drbd0: drbd0_worker [25514]: cstate SyncSource --> Connected
> nodeB heartbeat[25160]: info: Heartbeat restart on node nodeA
> nodeB heartbeat[25160]: info: Link nodeA:eth1 up.
> nodeB heartbeat[25160]: info: Status update for node nodeA: status up
> nodeB ipfail[25182]: info: Link Status update: Link nodeA/eth1 now has
> status up
> nodeB ipfail[25182]: info: Status update: Node nodeA now has status up
> nodeB heartbeat[25521]: debug: notify_world: setting SIGCHLD Handler to
> SIG_DFL
> nodeB heartbeat: info: Running /etc/ha.d/rc.d/status status
> nodeB heartbeat[25165]: ERROR: read_child send: RCs: 1 0
This isn't right. The word ERROR should never appear in any heartbeat
logs. This may be because of the read_child() process dying below. Too
bad you stripped off all the timestamps... :-( [It is more than possible
that heartbeat would get this message before noticing that it's child had
died).
> nodeB heartbeat[25160]: info: Status update for node nodeA: status active
> nodeB heartbeat[25525]: debug: notify_world: setting SIGCHLD Handler to
> SIG_DFL
> nodeB ipfail[25182]: info: Status update: Node nodeA now has status active
> nodeB heartbeat[25160]: info: remote resource transition completed.
> nodeB ccm[25181]: debug: received message resource orig=nodeB
> nodeB ipfail[25182]: debug: Other side is unstable.
> nodeB ipfail[25182]: debug: Other side is now stable.
> nodeB heartbeat: info: Running /etc/ha.d/rc.d/status status
> nodeB heartbeat[25160]: ERROR: Exiting HBREAD process 25165 killed by
> signal 11.
Hmmm... Version 1.2.3, and a read_child() process killed by SIGSEGV.
That's definitely not good...
Did you perchance shut down the network without shutting down heartbeat?
Getting a core dump from a read_child() is a little tricky since it runs as
nobody with an unwritable curdir... Here's what you have to do:
start heartbeat with ulimit -c unlimited
chmod 777 /etc/ha.d (yes, it sucks. Sorry.)
Then reproduce the problem that causes the SIGSEGV
You should now have a nice core file in /etc/ha.d/
If you can't get this going, we can meet for lunch and work this out -
assuming you're at UCAR in Boulder.
--
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me claim
from you at all times your undisguised opinions." - William Wilberforce