Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Shane, I'm running some simlar test right now and I am wodering what your ha.cf file looks like - do you mind sharing? Thanks, Dan > -----Original Message----- > From: Shane Swartz [mailto:sswartz at ucar.edu] > Sent: Friday, September 24, 2004 5:31 PM > To: drbd-user at lists.linbit.com > Subject: [DRBD-user] Failback problem with Heartbeat and DRBD > > I'm running DRBD 0.7.4 and Heartbeat 1.2.3 on a cluster > running Debian 2.4. I experience the problem when I simulate > a failure on the primary node by executing a reboot while > writing data to the file system controlled by DRBD. The > failover node takes control of all resources as it should > during the reboot, but once the primary node comes up > heartbeat on the failover is killed causing the primary node > to take control again. I have auto_failback set to off. > > When there are no current writes to the file system > controlled by DRBD on the primary node the failover node > maintains control of the resources once the primary comes > back up. That is the behavior that I would expect when > auto_failback is set to off. > > I have executed heartbeat standby on the primary while > writing to the file system controlled by DRBD and the > failover takes control of the resources and maintains control > when heartbeat on the primary node is restarted. > > I would like to know if I can get the failover node to > maintain control of all resources when the primary is > rebooted no matter if data was being written to a DRBD > controlled file system or not. Is DRBD and heartbeat > functioning as designed or is there a way to get it to > function like I want? > > Below is a portion of the syslog from the failover node. The > messages start from when the rebooted primary node just > starts coming back up. > > nodeB kernel: e1000: eth1 NIC Link is Down nodeB kernel: > e1000: eth1 NIC Link is Up 100 Mbps Full Duplex nodeB > ccm[25181]: info: ccm_joining_to_joined: cookie changed nodeB > kernel: e1000: eth1 NIC Link is Down nodeB kernel: e1000: > eth1 NIC Link is Up 1000 Mbps Full Duplex nodeB > heartbeat[25160]: WARN: node nodeA: is dead nodeB > heartbeat[25160]: info: Dead node nodeA gave up resources. > nodeB heartbeat[25160]: info: Link nodeA:/dev/ttyS0 dead. > nodeB heartbeat[25160]: info: Link nodeA:eth1 dead. > nodeB ipfail[25182]: info: Link Status update: Link > nodeA//dev/ttyS0 now has status dead nodeB ipfail[25182]: > debug: Found ping node group_one! > nodeB ipfail[25182]: debug: Found ping node group_two! > nodeB ipfail[25182]: info: Asking other side for ping node count. > nodeB ipfail[25182]: debug: Message [num_ping] sent. > nodeB ipfail[25182]: info: Checking remote count of ping nodes. > nodeB ipfail[25182]: info: Link Status update: Link > nodeA/eth1 now has status dead nodeB ipfail[25182]: debug: > Found ping node group_one! > nodeB ipfail[25182]: debug: Found ping node group_two! > nodeB ipfail[25182]: info: Asking other side for ping node count. > nodeB ipfail[25182]: debug: Message [num_ping] sent. > nodeB ipfail[25182]: info: Checking remote count of ping nodes. > nodeB kernel: e1000: eth1 NIC Link is Down nodeB kernel: > e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex nodeB > kernel: drbd0: drbd0_receiver [25086]: cstate WFConnection > --> WFReportParams nodeB kernel: drbd0: Handshake successful: > DRBD Network Protocol version 74 nodeB kernel: drbd0: > Connection established. > nodeB kernel: drbd0: I am(P): 1:00000002:00000001:0000006e:00000002:10 > nodeB kernel: drbd0: Peer(S): 1:00000002:00000001:0000006d:00000002:01 > nodeB kernel: drbd0: drbd0_receiver [25086]: cstate > WFReportParams --> WFBitMapS nodeB kernel: drbd0: > Primary/Unknown --> Primary/Secondary nodeB kernel: drbd0: > drbd0_receiver [25086]: cstate WFBitMapS --> SyncSource nodeB > kernel: drbd0: Resync started as SyncSource (need to sync 20 > KB [5 bits set]). > nodeB kernel: drbd0: Resync done (total 2 sec; paused 0 sec; > 8 K/sec) nodeB kernel: drbd0: drbd0_worker [25514]: cstate > SyncSource --> Connected nodeB heartbeat[25160]: info: > Heartbeat restart on node nodeA nodeB heartbeat[25160]: info: > Link nodeA:eth1 up. > nodeB heartbeat[25160]: info: Status update for node nodeA: > status up nodeB ipfail[25182]: info: Link Status update: Link > nodeA/eth1 now has status up nodeB ipfail[25182]: info: > Status update: Node nodeA now has status up nodeB > heartbeat[25521]: debug: notify_world: setting SIGCHLD > Handler to SIG_DFL nodeB heartbeat: info: Running > /etc/ha.d/rc.d/status status nodeB heartbeat[25165]: ERROR: > read_child send: RCs: 1 0 nodeB heartbeat[25160]: info: > Status update for node nodeA: status active nodeB > heartbeat[25525]: debug: notify_world: setting SIGCHLD > Handler to SIG_DFL nodeB ipfail[25182]: info: Status update: > Node nodeA now has status active nodeB heartbeat[25160]: > info: remote resource transition completed. > nodeB ccm[25181]: debug: received message resource orig=nodeB > nodeB ipfail[25182]: debug: Other side is unstable. > nodeB ipfail[25182]: debug: Other side is now stable. > nodeB heartbeat: info: Running /etc/ha.d/rc.d/status status > nodeB heartbeat[25160]: ERROR: Exiting HBREAD process 25165 > killed by signal 11. > nodeB heartbeat[25160]: ERROR: Core heartbeat process died! > Restarting. > nodeB heartbeat[25160]: info: Heartbeat shutdown in progress. > (25160) nodeB ccm[25181]: debug: received message starting > orig=nodeA nodeB heartbeat[25529]: info: Giving up all HA resources. > nodeB ipfail[25182]: debug: Got join message from another > ipfail client. > (nodeA) > nodeB ipfail[25182]: debug: Found ping node group_one! > nodeB ipfail[25182]: debug: Found ping node group_two! > nodeB ipfail[25182]: info: Asking other side for ping node count. > nodeB ipfail[25182]: debug: Message [num_ping] sent. > nodeB ipfail[25182]: info: No giveup timer to abort. > nodeB ccm[25181]: debug: received message resource orig=nodeA > nodeB ccm[25181]: debug: received message resource orig=nodeB > nodeB ccm[25181]: debug: received message resource orig=nodeA > nodeB heartbeat: info: Running > /etc/ha.d/resource.d/Filesystem /dev/drbd0 /d1 ext3 stop > nodeB heartbeat: debug: Starting > /etc/ha.d/resource.d/Filesystem /dev/drbd0 /d1 ext3 stop > nodeB ccm[25181]: debug: received message resource orig=nodeB > nodeB heartbeat: debug: /etc/ha.d/resource.d/Filesystem /dev/drbd0 /d1 > ext3 stop done. RC=0 > nodeB heartbeat: info: Running /etc/ha.d/resource.d/drbddisk > r0 stop nodeB heartbeat: debug: Starting > /etc/ha.d/resource.d/drbddisk r0 stop nodeB kernel: drbd0: > Primary/Secondary --> Secondary/Secondary nodeB heartbeat: > debug: /etc/ha.d/resource.d/drbddisk r0 stop done. RC=0 nodeB > heartbeat: info: Running /etc/ha.d/resource.d/IPaddr > 192.168.0.103/24/eth0/192.168.0.255 stop nodeB heartbeat: > debug: Starting /etc/ha.d/resource.d/IPaddr > 192.168.0.103/24/eth0/192.168.0.255 stop nodeB heartbeat: > info: /sbin/route -n del -host 192.168.0.103 nodeB heartbeat: > info: /sbin/ifconfig eth0:0 down nodeB heartbeat: info: IP > Address 192.168.0.103 released nodeB heartbeat: debug: > /etc/ha.d/resource.d/IPaddr > 192.168.0.103/24/eth0/192.168.0.255 stop done. RC=0 nodeB > heartbeat[25529]: info: killing /usr/lib/heartbeat/ccm > process group 25181 with signal 15 nodeB heartbeat[25529]: > info: killing /usr/lib/heartbeat/ipfail process group 25182 > with signal 15 nodeB heartbeat[25529]: info: All HA resources > relinquished. > nodeB heartbeat[25160]: info: EOF from client pid 25181 nodeB > heartbeat[25160]: info: killing /usr/lib/heartbeat/ipfail > process group 25182 with signal 15 nodeB heartbeat[25160]: > info: EOF from client pid 25182 nodeB heartbeat[25160]: info: > killing HBWRITE process 25168 with signal 15 nodeB > heartbeat[25160]: info: killing HBREAD process 25169 with > signal 15 nodeB heartbeat[25160]: info: killing HBWRITE > process 25170 with signal 15 nodeB heartbeat[25160]: info: > killing HBREAD process 25171 with signal 15 nodeB > heartbeat[25160]: info: killing HBFIFO process 25163 with > signal 15 nodeB heartbeat[25160]: info: killing HBWRITE > process 25164 with signal 15 nodeB heartbeat[25160]: info: > killing HBWRITE process 25166 with signal 15 nodeB > heartbeat[25160]: info: killing HBREAD process 25167 with > signal 15 nodeB heartbeat[25160]: info: Core process 25171 > exited. 8 remaining nodeB heartbeat[25160]: info: Core > process 25163 exited. 7 remaining nodeB heartbeat[25160]: > info: Core process 25166 exited. 6 remaining nodeB > heartbeat[25160]: info: Core process 25164 exited. 5 > remaining nodeB heartbeat[25160]: info: Core process 25167 > exited. 4 remaining nodeB heartbeat[25160]: info: Core > process 25168 exited. 3 remaining nodeB heartbeat[25160]: > info: Core process 25169 exited. 2 remaining nodeB > heartbeat[25160]: info: Core process 25170 exited. 1 > remaining nodeB heartbeat[25160]: info: Heartbeat shutdown complete. > nodeB heartbeat[25160]: info: Heartbeat restart triggered. > nodeB heartbeat[25160]: info: Restarting heartbeat. > nodeB heartbeat[25160]: info: Performing heartbeat restart exec. > nodeB kernel: drbd0: Secondary/Secondary --> > Secondary/Primary nodeB heartbeat[25160]: info: > ************************** nodeB heartbeat[25160]: info: > Configuration validated. Starting heartbeat 1.2.3 nodeB > heartbeat[26586]: info: heartbeat: version 1.2.3 nodeB > heartbeat[26586]: info: Heartbeat generation: 67 > > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user >