[DRBD-user] receiver & asender dying after a stonith recovery

Dave Dykstra dwdha at drdykstra.us
Tue Mar 22 22:55:13 CET 2005


I've been working on getting heartbeat's stonith to function properly on
my cluster that's using drbd.  I've got it to the point where I can unplug
the two network connections on the live server (one is a direct connect
between the two servers, which drbd uses, and the other is the main company
network) and stonith will temporarily remove power from the live server.
I always plug in the networks again as soon as the power comes back up.
The problem I'm having is that almost every time when that server comes
back up, drbd on the new live server does not re-establish communication
and the receiver and asender are not running.  If I then manually run
'drbdadm adjust all' on the new live server everything comes back up.
Below is /var/adm/messages from one of the cases.  Time 15:19:53 is when I
ran 'drbdadm adjust'.  Can anybody explain what's going on?  Am I supposed
to be having heartbeat doing something more so that 'drbdadm adjust'
will run?

I'm running debian drbd 0.7.10-2 with kernel 2.6.10-ac9.   My drbd.conf is
    resource home {
	protocol C;
	syncer {
	    rate        50M;
	}
	on swfs1 {
	    device  /dev/drbd0;
	    disk    /dev/hda8;
	    address 192.168.1.1:7791;
	    meta-disk internal;
	}
	on swfs2 {
	    device  /dev/drbd0;
	    disk    /dev/hdb8;
	    address 192.168.1.2:7791;
	    meta-disk internal;
	}
    }
and my ha.cf is
    keepalive 1
    warntime 2 
    deadtime 10
    node swfs1 swfs2
    ucast eth0 172.18.30.26 172.18.30.27
    bcast eth1 
    ping 172.18.1.1 
    apiauth ipfail uid=hacluster
    respawn hacluster /usr/lib/heartbeat/ipfail
    auto_failback off
    stonith_host swfs2 apcsmart /dev/ttyUSB0 swfs1
and my haresources is
    swfs1  \
        drbddisk::home Filesystem::/dev/drbd0::/mnt/home::ext3 \
        ypserv:: nfs-kernel-server samba \
        172.18.30.28 192.168.1.3 172.18.1.3 bind9 \
        Restart::ssh::up StatusChange::

- Dave Dykstra



Mar 22 15:18:38 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
Mar 22 15:18:49 swfs2 kernel: drbd0: drbd0_receiver [21346]: cstate WFConnection --> WFReportParams
Mar 22 15:18:49 swfs2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Mar 22 15:18:49 swfs2 kernel: drbd0: Connection established.
Mar 22 15:18:49 swfs2 kernel: drbd0: I am(P): 1:00000003:00000003:0000012e:00000027:10
Mar 22 15:18:49 swfs2 kernel: drbd0: Peer(S): 1:00000003:00000003:0000012f:00000026:10
Mar 22 15:18:49 swfs2 kernel: drbd0: drbd0_receiver [21346]: cstate WFReportParams --> StandAlone
Mar 22 15:18:49 swfs2 kernel: drbd0: worker terminated
Mar 22 15:18:49 swfs2 kernel: drbd0: asender terminated
Mar 22 15:18:49 swfs2 kernel: drbd0: drbd0_receiver [21346]: cstate StandAlone --> StandAlone
Mar 22 15:18:49 swfs2 kernel: drbd0: Connection lost.
Mar 22 15:18:49 swfs2 kernel: drbd0: receiver terminated
Mar 22 15:18:53 swfs2 heartbeat[19097]: info: Heartbeat restart on node swfs1
Mar 22 15:18:53 swfs2 heartbeat[19097]: info: Link swfs1:eth1 up.
Mar 22 15:18:53 swfs2 heartbeat[19097]: info: Status update for node swfs1: status up
Mar 22 15:18:53 swfs2 ipfail[19108]: info: Link Status update: Link swfs1/eth1 now has status up
Mar 22 15:18:53 swfs2 ipfail[19108]: info: Status update: Node swfs1 now has status up
Mar 22 15:18:53 swfs2 heartbeat: info: Running /etc/ha.d/rc.d/status status
Mar 22 15:18:53 swfs2 heartbeat[19097]: info: Status update for node swfs1: status active
Mar 22 15:18:53 swfs2 ipfail[19108]: info: Status update: Node swfs1 now has status active
Mar 22 15:18:53 swfs2 heartbeat[19097]: info: remote resource transition completed.
Mar 22 15:18:53 swfs2 heartbeat: info: Running /etc/ha.d/rc.d/status status
Mar 22 15:18:53 swfs2 ipfail[19108]: info: Asking other side for ping node count.
Mar 22 15:18:53 swfs2 ipfail[19108]: info: No giveup timer to abort.
Mar 22 15:19:53 swfs2 kernel: drbd0: drbdsetup [26362]: cstate StandAlone --> Unconnected
Mar 22 15:19:53 swfs2 kernel: drbd0: drbd0_receiver [26363]: cstate Unconnected --> WFConnection
Mar 22 15:19:53 swfs2 kernel: drbd0: drbd0_receiver [26363]: cstate WFConnection --> WFReportParams
Mar 22 15:19:53 swfs2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Mar 22 15:19:53 swfs2 kernel: drbd0: Connection established.
Mar 22 15:19:53 swfs2 kernel: drbd0: I am(P): 1:00000003:00000003:0000012f:00000027:10
Mar 22 15:19:53 swfs2 kernel: drbd0: Peer(S): 1:00000003:00000003:0000012f:00000026:00
Mar 22 15:19:53 swfs2 kernel: drbd0: drbd0_receiver [26363]: cstate WFReportParams --> WFBitMapS
Mar 22 15:19:53 swfs2 kernel: drbd0: Primary/Unknown --> Primary/Secondary
Mar 22 15:19:54 swfs2 kernel: drbd0: drbd0_receiver [26363]: cstate WFBitMapS --> SyncSource
Mar 22 15:19:54 swfs2 kernel: drbd0: Resync started as SyncSource (need to sync 520200 KB [130050 bits set]).
Mar 22 15:20:15 swfs2 kernel: drbd0: Resync done (total 21 sec; paused 0 sec; 24768 K/sec)
Mar 22 15:20:15 swfs2 kernel: drbd0: drbd0_worker [26336]: cstate SyncSource --> Connected




More information about the drbd-user mailing list