[DRBD-user] Failback problem with Heartbeat and DRBD

Mon Sep 27 21:53:21 CEST 2004

Shane,
I'm running some simlar test right now and I am wodering what your ha.cf
file looks like - do you mind sharing?

Thanks,
Dan

> -----Original Message-----
> From: Shane Swartz [mailto:sswartz at ucar.edu] 
> Sent: Friday, September 24, 2004 5:31 PM
> To: drbd-user at lists.linbit.com
> Subject: [DRBD-user] Failback problem with Heartbeat and DRBD
> 
> I'm running DRBD 0.7.4 and Heartbeat 1.2.3 on a cluster 
> running Debian 2.4.  I experience the problem when I simulate 
> a failure on the primary node by executing a reboot while 
> writing data to the file system controlled by DRBD.  The 
> failover node takes control of all resources as it should 
> during the reboot, but once the primary node comes up 
> heartbeat on the failover is killed causing the primary node 
> to take control again.  I have auto_failback set to off.
> 
> When there are no current writes to the file system 
> controlled by DRBD on the primary node the failover node 
> maintains control of the resources once the primary comes 
> back up.  That is the behavior that I would expect when 
> auto_failback is set to off.
> 
> I have executed heartbeat standby on the primary while 
> writing to the file system controlled by DRBD and the 
> failover takes control of the resources and maintains control 
> when heartbeat on the primary node is restarted.
> 
> I would like to know if I can get the failover node to 
> maintain control of all resources when the primary is 
> rebooted no matter if data was being written to a DRBD 
> controlled file system or not. Is DRBD and heartbeat 
> functioning as designed or is there a way to get it to 
> function like I want?
> 
> Below is a portion of the syslog from the failover node.  The 
> messages start from when the rebooted primary node just 
> starts coming back up.
> 
> nodeB kernel: e1000: eth1 NIC Link is Down nodeB kernel: 
> e1000: eth1 NIC Link is Up 100 Mbps Full Duplex nodeB 
> ccm[25181]: info: ccm_joining_to_joined: cookie changed nodeB 
> kernel: e1000: eth1 NIC Link is Down nodeB kernel: e1000: 
> eth1 NIC Link is Up 1000 Mbps Full Duplex nodeB 
> heartbeat[25160]: WARN: node nodeA: is dead nodeB 
> heartbeat[25160]: info: Dead node nodeA gave up resources.
> nodeB heartbeat[25160]: info: Link nodeA:/dev/ttyS0 dead.
> nodeB heartbeat[25160]: info: Link nodeA:eth1 dead.
> nodeB ipfail[25182]: info: Link Status update: Link 
> nodeA//dev/ttyS0 now has status dead nodeB ipfail[25182]: 
> debug: Found ping node group_one!
> nodeB ipfail[25182]: debug: Found ping node group_two!
> nodeB ipfail[25182]: info: Asking other side for ping node count.
> nodeB ipfail[25182]: debug: Message [num_ping] sent.
> nodeB ipfail[25182]: info: Checking remote count of ping nodes.
> nodeB ipfail[25182]: info: Link Status update: Link 
> nodeA/eth1 now has status dead nodeB ipfail[25182]: debug: 
> Found ping node group_one!
> nodeB ipfail[25182]: debug: Found ping node group_two!
> nodeB ipfail[25182]: info: Asking other side for ping node count.
> nodeB ipfail[25182]: debug: Message [num_ping] sent.
> nodeB ipfail[25182]: info: Checking remote count of ping nodes.
> nodeB kernel: e1000: eth1 NIC Link is Down nodeB kernel: 
> e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex nodeB 
> kernel: drbd0: drbd0_receiver [25086]: cstate WFConnection 
> --> WFReportParams nodeB kernel: drbd0: Handshake successful: 
> DRBD Network Protocol version 74 nodeB kernel: drbd0: 
> Connection established.
> nodeB kernel: drbd0: I am(P): 1:00000002:00000001:0000006e:00000002:10
> nodeB kernel: drbd0: Peer(S): 1:00000002:00000001:0000006d:00000002:01
> nodeB kernel: drbd0: drbd0_receiver [25086]: cstate 
> WFReportParams --> WFBitMapS nodeB kernel: drbd0: 
> Primary/Unknown --> Primary/Secondary nodeB kernel: drbd0: 
> drbd0_receiver [25086]: cstate WFBitMapS --> SyncSource nodeB 
> kernel: drbd0: Resync started as SyncSource (need to sync 20 
> KB [5 bits set]).
> nodeB kernel: drbd0: Resync done (total 2 sec; paused 0 sec; 
> 8 K/sec) nodeB kernel: drbd0: drbd0_worker [25514]: cstate 
> SyncSource --> Connected nodeB heartbeat[25160]: info: 
> Heartbeat restart on node nodeA nodeB heartbeat[25160]: info: 
> Link nodeA:eth1 up.
> nodeB heartbeat[25160]: info: Status update for node nodeA: 
> status up nodeB ipfail[25182]: info: Link Status update: Link 
> nodeA/eth1 now has status up nodeB ipfail[25182]: info: 
> Status update: Node nodeA now has status up nodeB 
> heartbeat[25521]: debug: notify_world: setting SIGCHLD 
> Handler to SIG_DFL nodeB heartbeat: info: Running 
> /etc/ha.d/rc.d/status status nodeB heartbeat[25165]: ERROR: 
> read_child send: RCs: 1 0 nodeB heartbeat[25160]: info: 
> Status update for node nodeA: status active nodeB 
> heartbeat[25525]: debug: notify_world: setting SIGCHLD 
> Handler to SIG_DFL nodeB ipfail[25182]: info: Status update: 
> Node nodeA now has status active nodeB heartbeat[25160]: 
> info: remote resource transition completed.
> nodeB ccm[25181]: debug: received message resource orig=nodeB 
> nodeB ipfail[25182]: debug: Other side is unstable.
> nodeB ipfail[25182]: debug: Other side is now stable.
> nodeB heartbeat: info: Running /etc/ha.d/rc.d/status status 
> nodeB heartbeat[25160]: ERROR: Exiting HBREAD process 25165 
> killed by signal 11.
> nodeB heartbeat[25160]: ERROR: Core heartbeat process died! 
> Restarting.
> nodeB heartbeat[25160]: info: Heartbeat shutdown in progress. 
> (25160) nodeB ccm[25181]: debug: received message starting 
> orig=nodeA nodeB heartbeat[25529]: info: Giving up all HA resources.
> nodeB ipfail[25182]: debug: Got join message from another 
> ipfail client. 
> (nodeA)
> nodeB ipfail[25182]: debug: Found ping node group_one!
> nodeB ipfail[25182]: debug: Found ping node group_two!
> nodeB ipfail[25182]: info: Asking other side for ping node count.
> nodeB ipfail[25182]: debug: Message [num_ping] sent.
> nodeB ipfail[25182]: info: No giveup timer to abort.
> nodeB ccm[25181]: debug: received message resource orig=nodeA 
> nodeB ccm[25181]: debug: received message resource orig=nodeB 
> nodeB ccm[25181]: debug: received message resource orig=nodeA 
> nodeB heartbeat: info: Running 
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /d1 ext3 stop 
> nodeB heartbeat: debug: Starting 
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /d1 ext3 stop 
> nodeB ccm[25181]: debug: received message resource orig=nodeB 
> nodeB heartbeat: debug: /etc/ha.d/resource.d/Filesystem /dev/drbd0 /d1
> ext3 stop done. RC=0
> nodeB heartbeat: info: Running /etc/ha.d/resource.d/drbddisk 
> r0 stop nodeB heartbeat: debug: Starting 
> /etc/ha.d/resource.d/drbddisk r0 stop nodeB kernel: drbd0: 
> Primary/Secondary --> Secondary/Secondary nodeB heartbeat: 
> debug: /etc/ha.d/resource.d/drbddisk r0 stop done. RC=0 nodeB 
> heartbeat: info: Running /etc/ha.d/resource.d/IPaddr
> 192.168.0.103/24/eth0/192.168.0.255 stop nodeB heartbeat: 
> debug: Starting /etc/ha.d/resource.d/IPaddr
> 192.168.0.103/24/eth0/192.168.0.255 stop nodeB heartbeat: 
> info: /sbin/route -n del -host 192.168.0.103 nodeB heartbeat: 
> info: /sbin/ifconfig eth0:0 down nodeB heartbeat: info: IP 
> Address 192.168.0.103 released nodeB heartbeat: debug: 
> /etc/ha.d/resource.d/IPaddr
> 192.168.0.103/24/eth0/192.168.0.255 stop done. RC=0 nodeB 
> heartbeat[25529]: info: killing /usr/lib/heartbeat/ccm 
> process group 25181 with signal 15 nodeB heartbeat[25529]: 
> info: killing /usr/lib/heartbeat/ipfail process group 25182 
> with signal 15 nodeB heartbeat[25529]: info: All HA resources 
> relinquished.
> nodeB heartbeat[25160]: info: EOF from client pid 25181 nodeB 
> heartbeat[25160]: info: killing /usr/lib/heartbeat/ipfail 
> process group 25182 with signal 15 nodeB heartbeat[25160]: 
> info: EOF from client pid 25182 nodeB heartbeat[25160]: info: 
> killing HBWRITE process 25168 with signal 15 nodeB 
> heartbeat[25160]: info: killing HBREAD process 25169 with 
> signal 15 nodeB heartbeat[25160]: info: killing HBWRITE 
> process 25170 with signal 15 nodeB heartbeat[25160]: info: 
> killing HBREAD process 25171 with signal 15 nodeB 
> heartbeat[25160]: info: killing HBFIFO process 25163 with 
> signal 15 nodeB heartbeat[25160]: info: killing HBWRITE 
> process 25164 with signal 15 nodeB heartbeat[25160]: info: 
> killing HBWRITE process 25166 with signal 15 nodeB 
> heartbeat[25160]: info: killing HBREAD process 25167 with 
> signal 15 nodeB heartbeat[25160]: info: Core process 25171 
> exited. 8 remaining nodeB heartbeat[25160]: info: Core 
> process 25163 exited. 7 remaining nodeB heartbeat[25160]: 
> info: Core process 25166 exited. 6 remaining nodeB 
> heartbeat[25160]: info: Core process 25164 exited. 5 
> remaining nodeB heartbeat[25160]: info: Core process 25167 
> exited. 4 remaining nodeB heartbeat[25160]: info: Core 
> process 25168 exited. 3 remaining nodeB heartbeat[25160]: 
> info: Core process 25169 exited. 2 remaining nodeB 
> heartbeat[25160]: info: Core process 25170 exited. 1 
> remaining nodeB heartbeat[25160]: info: Heartbeat shutdown complete.
> nodeB heartbeat[25160]: info: Heartbeat restart triggered.
> nodeB heartbeat[25160]: info: Restarting heartbeat.
> nodeB heartbeat[25160]: info: Performing heartbeat restart exec.
> nodeB kernel: drbd0: Secondary/Secondary --> 
> Secondary/Primary nodeB heartbeat[25160]: info: 
> ************************** nodeB heartbeat[25160]: info: 
> Configuration validated. Starting heartbeat 1.2.3 nodeB 
> heartbeat[26586]: info: heartbeat: version 1.2.3 nodeB 
> heartbeat[26586]: info: Heartbeat generation: 67
> 
> 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>