Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I'm running DRBD 0.7.4 and Heartbeat 1.2.3 on a cluster running Debian 2.4. I experience the problem when I simulate a failure on the primary node by executing a reboot while writing data to the file system controlled by DRBD. The failover node takes control of all resources as it should during the reboot, but once the primary node comes up heartbeat on the failover is killed causing the primary node to take control again. I have auto_failback set to off. When there are no current writes to the file system controlled by DRBD on the primary node the failover node maintains control of the resources once the primary comes back up. That is the behavior that I would expect when auto_failback is set to off. I have executed heartbeat standby on the primary while writing to the file system controlled by DRBD and the failover takes control of the resources and maintains control when heartbeat on the primary node is restarted. I would like to know if I can get the failover node to maintain control of all resources when the primary is rebooted no matter if data was being written to a DRBD controlled file system or not. Is DRBD and heartbeat functioning as designed or is there a way to get it to function like I want? Below is a portion of the syslog from the failover node. The messages start from when the rebooted primary node just starts coming back up. nodeB kernel: e1000: eth1 NIC Link is Down nodeB kernel: e1000: eth1 NIC Link is Up 100 Mbps Full Duplex nodeB ccm[25181]: info: ccm_joining_to_joined: cookie changed nodeB kernel: e1000: eth1 NIC Link is Down nodeB kernel: e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex nodeB heartbeat[25160]: WARN: node nodeA: is dead nodeB heartbeat[25160]: info: Dead node nodeA gave up resources. nodeB heartbeat[25160]: info: Link nodeA:/dev/ttyS0 dead. nodeB heartbeat[25160]: info: Link nodeA:eth1 dead. nodeB ipfail[25182]: info: Link Status update: Link nodeA//dev/ttyS0 now has status dead nodeB ipfail[25182]: debug: Found ping node group_one! nodeB ipfail[25182]: debug: Found ping node group_two! nodeB ipfail[25182]: info: Asking other side for ping node count. nodeB ipfail[25182]: debug: Message [num_ping] sent. nodeB ipfail[25182]: info: Checking remote count of ping nodes. nodeB ipfail[25182]: info: Link Status update: Link nodeA/eth1 now has status dead nodeB ipfail[25182]: debug: Found ping node group_one! nodeB ipfail[25182]: debug: Found ping node group_two! nodeB ipfail[25182]: info: Asking other side for ping node count. nodeB ipfail[25182]: debug: Message [num_ping] sent. nodeB ipfail[25182]: info: Checking remote count of ping nodes. nodeB kernel: e1000: eth1 NIC Link is Down nodeB kernel: e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFConnection --> WFReportParams nodeB kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 nodeB kernel: drbd0: Connection established. nodeB kernel: drbd0: I am(P): 1:00000002:00000001:0000006e:00000002:10 nodeB kernel: drbd0: Peer(S): 1:00000002:00000001:0000006d:00000002:01 nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFReportParams --> WFBitMapS nodeB kernel: drbd0: Primary/Unknown --> Primary/Secondary nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFBitMapS --> SyncSource nodeB kernel: drbd0: Resync started as SyncSource (need to sync 20 KB [5 bits set]). nodeB kernel: drbd0: Resync done (total 2 sec; paused 0 sec; 8 K/sec) nodeB kernel: drbd0: drbd0_worker [25514]: cstate SyncSource --> Connected nodeB heartbeat[25160]: info: Heartbeat restart on node nodeA nodeB heartbeat[25160]: info: Link nodeA:eth1 up. nodeB heartbeat[25160]: info: Status update for node nodeA: status up nodeB ipfail[25182]: info: Link Status update: Link nodeA/eth1 now has status up nodeB ipfail[25182]: info: Status update: Node nodeA now has status up nodeB heartbeat[25521]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL nodeB heartbeat: info: Running /etc/ha.d/rc.d/status status nodeB heartbeat[25165]: ERROR: read_child send: RCs: 1 0 nodeB heartbeat[25160]: info: Status update for node nodeA: status active nodeB heartbeat[25525]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL nodeB ipfail[25182]: info: Status update: Node nodeA now has status active nodeB heartbeat[25160]: info: remote resource transition completed. nodeB ccm[25181]: debug: received message resource orig=nodeB nodeB ipfail[25182]: debug: Other side is unstable. nodeB ipfail[25182]: debug: Other side is now stable. nodeB heartbeat: info: Running /etc/ha.d/rc.d/status status nodeB heartbeat[25160]: ERROR: Exiting HBREAD process 25165 killed by signal 11. nodeB heartbeat[25160]: ERROR: Core heartbeat process died! Restarting. nodeB heartbeat[25160]: info: Heartbeat shutdown in progress. (25160) nodeB ccm[25181]: debug: received message starting orig=nodeA nodeB heartbeat[25529]: info: Giving up all HA resources. nodeB ipfail[25182]: debug: Got join message from another ipfail client. (nodeA) nodeB ipfail[25182]: debug: Found ping node group_one! nodeB ipfail[25182]: debug: Found ping node group_two! nodeB ipfail[25182]: info: Asking other side for ping node count. nodeB ipfail[25182]: debug: Message [num_ping] sent. nodeB ipfail[25182]: info: No giveup timer to abort. nodeB ccm[25181]: debug: received message resource orig=nodeA nodeB ccm[25181]: debug: received message resource orig=nodeB nodeB ccm[25181]: debug: received message resource orig=nodeA nodeB heartbeat: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /d1 ext3 stop nodeB heartbeat: debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd0 /d1 ext3 stop nodeB ccm[25181]: debug: received message resource orig=nodeB nodeB heartbeat: debug: /etc/ha.d/resource.d/Filesystem /dev/drbd0 /d1 ext3 stop done. RC=0 nodeB heartbeat: info: Running /etc/ha.d/resource.d/drbddisk r0 stop nodeB heartbeat: debug: Starting /etc/ha.d/resource.d/drbddisk r0 stop nodeB kernel: drbd0: Primary/Secondary --> Secondary/Secondary nodeB heartbeat: debug: /etc/ha.d/resource.d/drbddisk r0 stop done. RC=0 nodeB heartbeat: info: Running /etc/ha.d/resource.d/IPaddr 192.168.0.103/24/eth0/192.168.0.255 stop nodeB heartbeat: debug: Starting /etc/ha.d/resource.d/IPaddr 192.168.0.103/24/eth0/192.168.0.255 stop nodeB heartbeat: info: /sbin/route -n del -host 192.168.0.103 nodeB heartbeat: info: /sbin/ifconfig eth0:0 down nodeB heartbeat: info: IP Address 192.168.0.103 released nodeB heartbeat: debug: /etc/ha.d/resource.d/IPaddr 192.168.0.103/24/eth0/192.168.0.255 stop done. RC=0 nodeB heartbeat[25529]: info: killing /usr/lib/heartbeat/ccm process group 25181 with signal 15 nodeB heartbeat[25529]: info: killing /usr/lib/heartbeat/ipfail process group 25182 with signal 15 nodeB heartbeat[25529]: info: All HA resources relinquished. nodeB heartbeat[25160]: info: EOF from client pid 25181 nodeB heartbeat[25160]: info: killing /usr/lib/heartbeat/ipfail process group 25182 with signal 15 nodeB heartbeat[25160]: info: EOF from client pid 25182 nodeB heartbeat[25160]: info: killing HBWRITE process 25168 with signal 15 nodeB heartbeat[25160]: info: killing HBREAD process 25169 with signal 15 nodeB heartbeat[25160]: info: killing HBWRITE process 25170 with signal 15 nodeB heartbeat[25160]: info: killing HBREAD process 25171 with signal 15 nodeB heartbeat[25160]: info: killing HBFIFO process 25163 with signal 15 nodeB heartbeat[25160]: info: killing HBWRITE process 25164 with signal 15 nodeB heartbeat[25160]: info: killing HBWRITE process 25166 with signal 15 nodeB heartbeat[25160]: info: killing HBREAD process 25167 with signal 15 nodeB heartbeat[25160]: info: Core process 25171 exited. 8 remaining nodeB heartbeat[25160]: info: Core process 25163 exited. 7 remaining nodeB heartbeat[25160]: info: Core process 25166 exited. 6 remaining nodeB heartbeat[25160]: info: Core process 25164 exited. 5 remaining nodeB heartbeat[25160]: info: Core process 25167 exited. 4 remaining nodeB heartbeat[25160]: info: Core process 25168 exited. 3 remaining nodeB heartbeat[25160]: info: Core process 25169 exited. 2 remaining nodeB heartbeat[25160]: info: Core process 25170 exited. 1 remaining nodeB heartbeat[25160]: info: Heartbeat shutdown complete. nodeB heartbeat[25160]: info: Heartbeat restart triggered. nodeB heartbeat[25160]: info: Restarting heartbeat. nodeB heartbeat[25160]: info: Performing heartbeat restart exec. nodeB kernel: drbd0: Secondary/Secondary --> Secondary/Primary nodeB heartbeat[25160]: info: ************************** nodeB heartbeat[25160]: info: Configuration validated. Starting heartbeat 1.2.3 nodeB heartbeat[26586]: info: heartbeat: version 1.2.3 nodeB heartbeat[26586]: info: Heartbeat generation: 67