[DRBD-user] Failback problem with Heartbeat and DRBD

Fri Sep 24 23:30:43 CEST 2004

I'm running DRBD 0.7.4 and Heartbeat 1.2.3 on a cluster running Debian 
2.4.  I experience the problem when I simulate a failure on the primary 
node by executing a reboot while writing data to the file system 
controlled by DRBD.  The failover node takes control of all resources as 
it should during the reboot, but once the primary node comes up 
heartbeat on the failover is killed causing the primary node to take 
control again.  I have auto_failback set to off.

When there are no current writes to the file system controlled by DRBD 
on the primary node the failover node maintains control of the resources 
once the primary comes back up.  That is the behavior that I would 
expect when auto_failback is set to off.

I have executed heartbeat standby on the primary while writing to the 
file system controlled by DRBD and the failover takes control of the 
resources and maintains control when heartbeat on the primary node is 
restarted.

I would like to know if I can get the failover node to maintain control 
of all resources when the primary is rebooted no matter if data was 
being written to a DRBD controlled file system or not. Is DRBD and 
heartbeat functioning as designed or is there a way to get it to 
function like I want?

Below is a portion of the syslog from the failover node.  The messages 
start from when the rebooted primary node just starts coming back up.

nodeB kernel: e1000: eth1 NIC Link is Down
nodeB kernel: e1000: eth1 NIC Link is Up 100 Mbps Full Duplex
nodeB ccm[25181]: info: ccm_joining_to_joined: cookie changed
nodeB kernel: e1000: eth1 NIC Link is Down
nodeB kernel: e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex
nodeB heartbeat[25160]: WARN: node nodeA: is dead
nodeB heartbeat[25160]: info: Dead node nodeA gave up resources.
nodeB heartbeat[25160]: info: Link nodeA:/dev/ttyS0 dead.
nodeB heartbeat[25160]: info: Link nodeA:eth1 dead.
nodeB ipfail[25182]: info: Link Status update: Link nodeA//dev/ttyS0 now 
has status dead
nodeB ipfail[25182]: debug: Found ping node group_one!
nodeB ipfail[25182]: debug: Found ping node group_two!
nodeB ipfail[25182]: info: Asking other side for ping node count.
nodeB ipfail[25182]: debug: Message [num_ping] sent.
nodeB ipfail[25182]: info: Checking remote count of ping nodes.
nodeB ipfail[25182]: info: Link Status update: Link nodeA/eth1 now has 
status dead
nodeB ipfail[25182]: debug: Found ping node group_one!
nodeB ipfail[25182]: debug: Found ping node group_two!
nodeB ipfail[25182]: info: Asking other side for ping node count.
nodeB ipfail[25182]: debug: Message [num_ping] sent.
nodeB ipfail[25182]: info: Checking remote count of ping nodes.
nodeB kernel: e1000: eth1 NIC Link is Down
nodeB kernel: e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex
nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFConnection --> 
WFReportParams
nodeB kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
nodeB kernel: drbd0: Connection established.
nodeB kernel: drbd0: I am(P): 1:00000002:00000001:0000006e:00000002:10
nodeB kernel: drbd0: Peer(S): 1:00000002:00000001:0000006d:00000002:01
nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFReportParams --> 
WFBitMapS
nodeB kernel: drbd0: Primary/Unknown --> Primary/Secondary
nodeB kernel: drbd0: drbd0_receiver [25086]: cstate WFBitMapS --> SyncSource
nodeB kernel: drbd0: Resync started as SyncSource (need to sync 20 KB [5 
bits set]).
nodeB kernel: drbd0: Resync done (total 2 sec; paused 0 sec; 8 K/sec)
nodeB kernel: drbd0: drbd0_worker [25514]: cstate SyncSource --> Connected
nodeB heartbeat[25160]: info: Heartbeat restart on node nodeA
nodeB heartbeat[25160]: info: Link nodeA:eth1 up.
nodeB heartbeat[25160]: info: Status update for node nodeA: status up
nodeB ipfail[25182]: info: Link Status update: Link nodeA/eth1 now has 
status up
nodeB ipfail[25182]: info: Status update: Node nodeA now has status up
nodeB heartbeat[25521]: debug: notify_world: setting SIGCHLD Handler to 
SIG_DFL
nodeB heartbeat: info: Running /etc/ha.d/rc.d/status status
nodeB heartbeat[25165]: ERROR: read_child send: RCs: 1 0
nodeB heartbeat[25160]: info: Status update for node nodeA: status active
nodeB heartbeat[25525]: debug: notify_world: setting SIGCHLD Handler to 
SIG_DFL
nodeB ipfail[25182]: info: Status update: Node nodeA now has status active
nodeB heartbeat[25160]: info: remote resource transition completed.
nodeB ccm[25181]: debug: received message resource orig=nodeB
nodeB ipfail[25182]: debug: Other side is unstable.
nodeB ipfail[25182]: debug: Other side is now stable.
nodeB heartbeat: info: Running /etc/ha.d/rc.d/status status
nodeB heartbeat[25160]: ERROR: Exiting HBREAD process 25165 killed by 
signal 11.
nodeB heartbeat[25160]: ERROR: Core heartbeat process died! Restarting.
nodeB heartbeat[25160]: info: Heartbeat shutdown in progress. (25160)
nodeB ccm[25181]: debug: received message starting orig=nodeA
nodeB heartbeat[25529]: info: Giving up all HA resources.
nodeB ipfail[25182]: debug: Got join message from another ipfail client. 
(nodeA)
nodeB ipfail[25182]: debug: Found ping node group_one!
nodeB ipfail[25182]: debug: Found ping node group_two!
nodeB ipfail[25182]: info: Asking other side for ping node count.
nodeB ipfail[25182]: debug: Message [num_ping] sent.
nodeB ipfail[25182]: info: No giveup timer to abort.
nodeB ccm[25181]: debug: received message resource orig=nodeA
nodeB ccm[25181]: debug: received message resource orig=nodeB
nodeB ccm[25181]: debug: received message resource orig=nodeA
nodeB heartbeat: info: Running /etc/ha.d/resource.d/Filesystem 
/dev/drbd0 /d1 ext3 stop
nodeB heartbeat: debug: Starting /etc/ha.d/resource.d/Filesystem 
/dev/drbd0 /d1 ext3 stop
nodeB ccm[25181]: debug: received message resource orig=nodeB
nodeB heartbeat: debug: /etc/ha.d/resource.d/Filesystem /dev/drbd0 /d1 
ext3 stop done. RC=0
nodeB heartbeat: info: Running /etc/ha.d/resource.d/drbddisk r0 stop
nodeB heartbeat: debug: Starting /etc/ha.d/resource.d/drbddisk r0 stop
nodeB kernel: drbd0: Primary/Secondary --> Secondary/Secondary
nodeB heartbeat: debug: /etc/ha.d/resource.d/drbddisk r0 stop done. RC=0
nodeB heartbeat: info: Running /etc/ha.d/resource.d/IPaddr 
192.168.0.103/24/eth0/192.168.0.255 stop
nodeB heartbeat: debug: Starting /etc/ha.d/resource.d/IPaddr 
192.168.0.103/24/eth0/192.168.0.255 stop
nodeB heartbeat: info: /sbin/route -n del -host 192.168.0.103
nodeB heartbeat: info: /sbin/ifconfig eth0:0 down
nodeB heartbeat: info: IP Address 192.168.0.103 released
nodeB heartbeat: debug: /etc/ha.d/resource.d/IPaddr 
192.168.0.103/24/eth0/192.168.0.255 stop done. RC=0
nodeB heartbeat[25529]: info: killing /usr/lib/heartbeat/ccm process 
group 25181 with signal 15
nodeB heartbeat[25529]: info: killing /usr/lib/heartbeat/ipfail process 
group 25182 with signal 15
nodeB heartbeat[25529]: info: All HA resources relinquished.
nodeB heartbeat[25160]: info: EOF from client pid 25181
nodeB heartbeat[25160]: info: killing /usr/lib/heartbeat/ipfail process 
group 25182 with signal 15
nodeB heartbeat[25160]: info: EOF from client pid 25182
nodeB heartbeat[25160]: info: killing HBWRITE process 25168 with signal 15
nodeB heartbeat[25160]: info: killing HBREAD process 25169 with signal 15
nodeB heartbeat[25160]: info: killing HBWRITE process 25170 with signal 15
nodeB heartbeat[25160]: info: killing HBREAD process 25171 with signal 15
nodeB heartbeat[25160]: info: killing HBFIFO process 25163 with signal 15
nodeB heartbeat[25160]: info: killing HBWRITE process 25164 with signal 15
nodeB heartbeat[25160]: info: killing HBWRITE process 25166 with signal 15
nodeB heartbeat[25160]: info: killing HBREAD process 25167 with signal 15
nodeB heartbeat[25160]: info: Core process 25171 exited. 8 remaining
nodeB heartbeat[25160]: info: Core process 25163 exited. 7 remaining
nodeB heartbeat[25160]: info: Core process 25166 exited. 6 remaining
nodeB heartbeat[25160]: info: Core process 25164 exited. 5 remaining
nodeB heartbeat[25160]: info: Core process 25167 exited. 4 remaining
nodeB heartbeat[25160]: info: Core process 25168 exited. 3 remaining
nodeB heartbeat[25160]: info: Core process 25169 exited. 2 remaining
nodeB heartbeat[25160]: info: Core process 25170 exited. 1 remaining
nodeB heartbeat[25160]: info: Heartbeat shutdown complete.
nodeB heartbeat[25160]: info: Heartbeat restart triggered.
nodeB heartbeat[25160]: info: Restarting heartbeat.
nodeB heartbeat[25160]: info: Performing heartbeat restart exec.
nodeB kernel: drbd0: Secondary/Secondary --> Secondary/Primary
nodeB heartbeat[25160]: info: **************************
nodeB heartbeat[25160]: info: Configuration validated. Starting 
heartbeat 1.2.3
nodeB heartbeat[26586]: info: heartbeat: version 1.2.3
nodeB heartbeat[26586]: info: Heartbeat generation: 67