[DRBD-user] Failback problem with Heartbeat and DRBD

Sun Sep 26 02:04:13 CEST 2004

Shane Swartz wrote:
> I'm running DRBD 0.7.4 and Heartbeat 1.2.3 on a cluster running Debian 
> 2.4.  I experience the problem when I simulate a failure on the primary 
> node by executing a reboot while writing data to the file system 
> controlled by DRBD.  The failover node takes control of all resources as 
> it should during the reboot, but once the primary node comes up 
> heartbeat on the failover is killed causing the primary node to take 
> control again.  I have auto_failback set to off.
> 
> When there are no current writes to the file system controlled by DRBD 
> on the primary node the failover node maintains control of the resources 
> once the primary comes back up.  That is the behavior that I would 
> expect when auto_failback is set to off.
> 
> I have executed heartbeat standby on the primary while writing to the 
> file system controlled by DRBD and the failover takes control of the 
> resources and maintains control when heartbeat on the primary node is 
> restarted.
> 
> I would like to know if I can get the failover node to maintain control 
> of all resources when the primary is rebooted no matter if data was 
> being written to a DRBD controlled file system or not. Is DRBD and 
> heartbeat functioning as designed or is there a way to get it to 
> function like I want?
> 
> Below is a portion of the syslog from the failover node.  The messages 
> start from when the rebooted primary node just starts coming back up.
> 
> nodeB kernel: e1000: eth1 NIC Link is Down
> nodeB kernel: e1000: eth1 NIC Link is Up 100 Mbps Full Duplex
> nodeB ccm[25181]: info: ccm_joining_to_joined: cookie changed
> nodeB kernel: e1000: eth1 NIC Link is Down
> nodeB kernel: e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex
> nodeB heartbeat[25160]: WARN: node nodeA: is dead
> nodeB heartbeat[25160]: info: Dead node nodeA gave up resources.
> nodeB heartbeat[25160]: info: Link nodeA:/dev/ttyS0 dead.
> nodeB heartbeat[25160]: info: Link nodeA:eth1 dead.
> nodeB ipfail[25182]: info: Link Status update: Link nodeA//dev/ttyS0 now 
> has status dead
> nodeB ipfail[25182]: debug: Found ping node group_one!
> nodeB ipfail[25182]: debug: Found ping node group_two!
> nodeB ipfail[25182]: info: Asking other side for ping node count.
> nodeB ipfail[25182]: debug: Message [num_ping] sent.
> nodeB ipfail[25182]: info: Checking remote count of ping nodes.
> nodeB ipfail[25182]: info: Link Status update: Link nodeA/eth1 now has 
> status dead
> nodeB ipfail[25182]: debug: Found ping node group_one!
> nodeB ipfail[25182]: debug: Found ping node group_two!
> nodeB ipfail[25182]: info: Asking other side for ping node count.
> nodeB ipfail[25182]: debug: Message [num_ping] sent.
> nodeB ipfail[25182]: info: Checking remote count of ping nodes.
> nodeB kernel: e1000: eth1 NIC Link is Down
> nodeB kernel: e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex
> nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFConnection --> 
> WFReportParams
> nodeB kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
> nodeB kernel: drbd0: Connection established.
> nodeB kernel: drbd0: I am(P): 1:00000002:00000001:0000006e:00000002:10
> nodeB kernel: drbd0: Peer(S): 1:00000002:00000001:0000006d:00000002:01
> nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFReportParams --> 
> WFBitMapS
> nodeB kernel: drbd0: Primary/Unknown --> Primary/Secondary
> nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFBitMapS --> 
> SyncSource
> nodeB kernel: drbd0: Resync started as SyncSource (need to sync 20 KB [5 
> bits set]).
> nodeB kernel: drbd0: Resync done (total 2 sec; paused 0 sec; 8 K/sec)
> nodeB kernel: drbd0: drbd0_worker [25514]: cstate SyncSource --> Connected
> nodeB heartbeat[25160]: info: Heartbeat restart on node nodeA
> nodeB heartbeat[25160]: info: Link nodeA:eth1 up.
> nodeB heartbeat[25160]: info: Status update for node nodeA: status up
> nodeB ipfail[25182]: info: Link Status update: Link nodeA/eth1 now has 
> status up
> nodeB ipfail[25182]: info: Status update: Node nodeA now has status up
> nodeB heartbeat[25521]: debug: notify_world: setting SIGCHLD Handler to 
> SIG_DFL
> nodeB heartbeat: info: Running /etc/ha.d/rc.d/status status
> nodeB heartbeat[25165]: ERROR: read_child send: RCs: 1 0

This isn't right.  The word ERROR should never appear in any heartbeat 
logs.  This may be because of the read_child() process dying below.  Too 
bad you stripped off all the timestamps... :-(  [It is more than possible 
that heartbeat would get this message before noticing that it's child had 
died).

> nodeB heartbeat[25160]: info: Status update for node nodeA: status active
> nodeB heartbeat[25525]: debug: notify_world: setting SIGCHLD Handler to 
> SIG_DFL
> nodeB ipfail[25182]: info: Status update: Node nodeA now has status active
> nodeB heartbeat[25160]: info: remote resource transition completed.
> nodeB ccm[25181]: debug: received message resource orig=nodeB
> nodeB ipfail[25182]: debug: Other side is unstable.
> nodeB ipfail[25182]: debug: Other side is now stable.
> nodeB heartbeat: info: Running /etc/ha.d/rc.d/status status
> nodeB heartbeat[25160]: ERROR: Exiting HBREAD process 25165 killed by 
> signal 11.

Hmmm... Version 1.2.3, and a read_child() process killed by SIGSEGV. 
That's definitely not good...

Did you perchance shut down the network without shutting down heartbeat?

Getting a core dump from a read_child() is a little tricky since it runs as 
nobody with an unwritable curdir...  Here's what you have to do:

	start heartbeat with ulimit -c unlimited
	chmod 777 /etc/ha.d  (yes, it sucks.  Sorry.)
	Then reproduce the problem that causes the SIGSEGV
	You should now have a nice core file in /etc/ha.d/

If you can't get this going, we can meet for lunch and work this out - 
assuming you're at UCAR in Boulder.

-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions." - William Wilberforce