Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I've been working on getting heartbeat's stonith to function properly on my cluster that's using drbd. I've got it to the point where I can unplug the two network connections on the live server (one is a direct connect between the two servers, which drbd uses, and the other is the main company network) and stonith will temporarily remove power from the live server. I always plug in the networks again as soon as the power comes back up. The problem I'm having is that almost every time when that server comes back up, drbd on the new live server does not re-establish communication and the receiver and asender are not running. If I then manually run 'drbdadm adjust all' on the new live server everything comes back up. Below is /var/adm/messages from one of the cases. Time 15:19:53 is when I ran 'drbdadm adjust'. Can anybody explain what's going on? Am I supposed to be having heartbeat doing something more so that 'drbdadm adjust' will run? I'm running debian drbd 0.7.10-2 with kernel 2.6.10-ac9. My drbd.conf is resource home { protocol C; syncer { rate 50M; } on swfs1 { device /dev/drbd0; disk /dev/hda8; address 192.168.1.1:7791; meta-disk internal; } on swfs2 { device /dev/drbd0; disk /dev/hdb8; address 192.168.1.2:7791; meta-disk internal; } } and my ha.cf is keepalive 1 warntime 2 deadtime 10 node swfs1 swfs2 ucast eth0 172.18.30.26 172.18.30.27 bcast eth1 ping 172.18.1.1 apiauth ipfail uid=hacluster respawn hacluster /usr/lib/heartbeat/ipfail auto_failback off stonith_host swfs2 apcsmart /dev/ttyUSB0 swfs1 and my haresources is swfs1 \ drbddisk::home Filesystem::/dev/drbd0::/mnt/home::ext3 \ ypserv:: nfs-kernel-server samba \ 172.18.30.28 192.168.1.3 172.18.1.3 bind9 \ Restart::ssh::up StatusChange:: - Dave Dykstra Mar 22 15:18:38 swfs2 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex Mar 22 15:18:49 swfs2 kernel: drbd0: drbd0_receiver [21346]: cstate WFConnection --> WFReportParams Mar 22 15:18:49 swfs2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Mar 22 15:18:49 swfs2 kernel: drbd0: Connection established. Mar 22 15:18:49 swfs2 kernel: drbd0: I am(P): 1:00000003:00000003:0000012e:00000027:10 Mar 22 15:18:49 swfs2 kernel: drbd0: Peer(S): 1:00000003:00000003:0000012f:00000026:10 Mar 22 15:18:49 swfs2 kernel: drbd0: drbd0_receiver [21346]: cstate WFReportParams --> StandAlone Mar 22 15:18:49 swfs2 kernel: drbd0: worker terminated Mar 22 15:18:49 swfs2 kernel: drbd0: asender terminated Mar 22 15:18:49 swfs2 kernel: drbd0: drbd0_receiver [21346]: cstate StandAlone --> StandAlone Mar 22 15:18:49 swfs2 kernel: drbd0: Connection lost. Mar 22 15:18:49 swfs2 kernel: drbd0: receiver terminated Mar 22 15:18:53 swfs2 heartbeat[19097]: info: Heartbeat restart on node swfs1 Mar 22 15:18:53 swfs2 heartbeat[19097]: info: Link swfs1:eth1 up. Mar 22 15:18:53 swfs2 heartbeat[19097]: info: Status update for node swfs1: status up Mar 22 15:18:53 swfs2 ipfail[19108]: info: Link Status update: Link swfs1/eth1 now has status up Mar 22 15:18:53 swfs2 ipfail[19108]: info: Status update: Node swfs1 now has status up Mar 22 15:18:53 swfs2 heartbeat: info: Running /etc/ha.d/rc.d/status status Mar 22 15:18:53 swfs2 heartbeat[19097]: info: Status update for node swfs1: status active Mar 22 15:18:53 swfs2 ipfail[19108]: info: Status update: Node swfs1 now has status active Mar 22 15:18:53 swfs2 heartbeat[19097]: info: remote resource transition completed. Mar 22 15:18:53 swfs2 heartbeat: info: Running /etc/ha.d/rc.d/status status Mar 22 15:18:53 swfs2 ipfail[19108]: info: Asking other side for ping node count. Mar 22 15:18:53 swfs2 ipfail[19108]: info: No giveup timer to abort. Mar 22 15:19:53 swfs2 kernel: drbd0: drbdsetup [26362]: cstate StandAlone --> Unconnected Mar 22 15:19:53 swfs2 kernel: drbd0: drbd0_receiver [26363]: cstate Unconnected --> WFConnection Mar 22 15:19:53 swfs2 kernel: drbd0: drbd0_receiver [26363]: cstate WFConnection --> WFReportParams Mar 22 15:19:53 swfs2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Mar 22 15:19:53 swfs2 kernel: drbd0: Connection established. Mar 22 15:19:53 swfs2 kernel: drbd0: I am(P): 1:00000003:00000003:0000012f:00000027:10 Mar 22 15:19:53 swfs2 kernel: drbd0: Peer(S): 1:00000003:00000003:0000012f:00000026:00 Mar 22 15:19:53 swfs2 kernel: drbd0: drbd0_receiver [26363]: cstate WFReportParams --> WFBitMapS Mar 22 15:19:53 swfs2 kernel: drbd0: Primary/Unknown --> Primary/Secondary Mar 22 15:19:54 swfs2 kernel: drbd0: drbd0_receiver [26363]: cstate WFBitMapS --> SyncSource Mar 22 15:19:54 swfs2 kernel: drbd0: Resync started as SyncSource (need to sync 520200 KB [130050 bits set]). Mar 22 15:20:15 swfs2 kernel: drbd0: Resync done (total 21 sec; paused 0 sec; 24768 K/sec) Mar 22 15:20:15 swfs2 kernel: drbd0: drbd0_worker [26336]: cstate SyncSource --> Connected