Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
2009/6/3 Ivars Strazdiņš <ivars.strazdins at gmail.com> > Heiko, > any hint from you about actual crash itself (stack trace) as well as OS > and software versions involved (xen/heartbeat/drbd etc.)? > BR, > Ivars > Hello, drbd is version 8.0.13 heartbeat is 2.1.3 xen is 3.0.3 this is the ha logfile from the 2nd server, but this one did also a reboot. heartbeat[14405]: 2009/06/02_13:59:42 WARN: node xen-a1.fra1.mailcluster: is dead heartbeat[14405]: 2009/06/02_13:59:42 WARN: No STONITH device configured. heartbeat[14405]: 2009/06/02_13:59:42 WARN: Shared disks are not protected. heartbeat[14405]: 2009/06/02_13:59:42 info: Resources being acquired from xen-a1.fra1.mailcluster. heartbeat[14405]: 2009/06/02_13:59:42 info: Link xen-a1.fra1.mailcluster:eth0 dead. harc[22620]: 2009/06/02_13:59:42 info: Running /etc/ha.d/rc.d/status status mach_down[22653]: 2009/06/02_13:59:42 info: Taking over resource group drbddisk::drbd_backend heartbeat[22621]: 2009/06/02_13:59:42 info: Local Resource acquisition completed. ResourceManager[22699]: 2009/06/02_13:59:43 info: Acquiring resource group: xen-a1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1 ResourceManager[22699]: 2009/06/02_13:59:43 info: Running /etc/ha.d/resource.d/drbddisk drbd_backend start ResourceManager[22699]: 2009/06/02_13:59:43 info: Running /etc/ha.d/resource.d/xen backend-A1 start mach_down[22653]: 2009/06/02_13:59:45 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down[22653]: 2009/06/02_13:59:45 info: mach_down takeover complete for node xen-a1.fra1.mailcluster. heartbeat[14405]: 2009/06/02_13:59:45 info: mach_down takeover complete. heartbeat[14405]: 2009/06/02_14:00:35 info: Heartbeat restart on node xen-a1.fra1.mailcluster heartbeat[14405]: 2009/06/02_14:00:35 info: Link xen-a1.fra1.mailcluster:eth0 up. heartbeat[14405]: 2009/06/02_14:00:35 info: Status update for node xen-a1.fra1.mailcluster: status init heartbeat[14405]: 2009/06/02_14:00:35 info: Status update for node xen-a1.fra1.mailcluster: status up harc[23036]: 2009/06/02_14:00:35 info: Running /etc/ha.d/rc.d/status status harc[23052]: 2009/06/02_14:00:35 info: Running /etc/ha.d/rc.d/status status heartbeat[14405]: 2009/06/02_14:00:37 info: Status update for node xen-a1.fra1.mailcluster: status active harc[23068]: 2009/06/02_14:00:37 info: Running /etc/ha.d/rc.d/status status heartbeat[14405]: 2009/06/02_14:00:38 info: remote resource transition completed. heartbeat[14405]: 2009/06/02_14:00:38 info: xen-b1.fra1.mailcluster wants to go standby [foreign] heartbeat[14405]: 2009/06/02_14:00:38 info: standby: xen-a1.fra1.mailcluster can take our foreign resources heartbeat[23084]: 2009/06/02_14:00:38 info: give up foreign HA resources (standby). ResourceManager[23097]: 2009/06/02_14:00:38 info: Releasing resource group: xen-a1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1 ResourceManager[23097]: 2009/06/02_14:00:38 info: Running /etc/ha.d/resource.d/xen backend-A1 stop heartbeat[3024]: 2009/06/02_14:01:13 info: Version 2 support: false heartbeat[3024]: 2009/06/02_14:01:13 WARN: Logging daemon is disabled --enabling logging daemon is recommended heartbeat[3024]: 2009/06/02_14:01:13 info: ************************** heartbeat[3024]: 2009/06/02_14:01:13 info: Configuration validated. Starting heartbeat 2.1.3 heartbeat[3025]: 2009/06/02_14:01:13 info: heartbeat: version 2.1.3 heartbeat[3025]: 2009/06/02_14:01:13 info: Heartbeat generation: 1202824473 heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0 heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: bound send socket to device: eth0 heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: bound receive socket to device: eth0 heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: started on port 694 interface eth0 to 172.20.1.1 heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0 heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: bound send socket to device: eth0 heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: bound receive socket to device: eth0 heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: started on port 694 interface eth0 to 172.20.2.1 heartbeat[3025]: 2009/06/02_14:01:13 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[3025]: 2009/06/02_14:01:13 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[3025]: 2009/06/02_14:01:13 info: G_main_add_SignalHandler: Added signal handler for signal 17 heartbeat[3025]: 2009/06/02_14:01:13 info: Local status now set to: 'up' heartbeat[3064]: 2009/06/02_14:01:19 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3064]: 2009/06/02_14:01:19 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3062]: 2009/06/02_14:01:19 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3062]: 2009/06/02_14:01:19 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3064]: 2009/06/02_14:01:21 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3064]: 2009/06/02_14:01:21 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3062]: 2009/06/02_14:01:21 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3062]: 2009/06/02_14:01:21 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3064]: 2009/06/02_14:01:23 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3064]: 2009/06/02_14:01:23 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3062]: 2009/06/02_14:01:23 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3062]: 2009/06/02_14:01:23 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3064]: 2009/06/02_14:01:25 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3064]: 2009/06/02_14:01:25 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3062]: 2009/06/02_14:01:25 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3062]: 2009/06/02_14:01:25 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3064]: 2009/06/02_14:01:27 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3064]: 2009/06/02_14:01:27 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3062]: 2009/06/02_14:01:27 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3062]: 2009/06/02_14:01:27 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3064]: 2009/06/02_14:01:29 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3064]: 2009/06/02_14:01:29 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3062]: 2009/06/02_14:01:29 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3062]: 2009/06/02_14:01:29 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3064]: 2009/06/02_14:01:31 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3064]: 2009/06/02_14:01:31 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3062]: 2009/06/02_14:01:31 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3062]: 2009/06/02_14:01:31 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3064]: 2009/06/02_14:01:33 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3064]: 2009/06/02_14:01:33 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3062]: 2009/06/02_14:01:33 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3062]: 2009/06/02_14:01:33 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3064]: 2009/06/02_14:01:35 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3064]: 2009/06/02_14:01:35 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3062]: 2009/06/02_14:01:35 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3062]: 2009/06/02_14:01:35 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3064]: 2009/06/02_14:01:37 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3064]: 2009/06/02_14:01:37 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3064]: 2009/06/02_14:01:37 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3064]: 2009/06/02_14:01:37 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3064]: 2009/06/02_14:01:37 WARN: Temporarily Suppressing write error messages heartbeat[3064]: 2009/06/02_14:01:37 WARN: Is a cable unplugged on ucast eth0? heartbeat[3062]: 2009/06/02_14:01:37 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3062]: 2009/06/02_14:01:37 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3062]: 2009/06/02_14:01:37 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3062]: 2009/06/02_14:01:37 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3062]: 2009/06/02_14:01:37 WARN: Temporarily Suppressing write error messages heartbeat[3062]: 2009/06/02_14:01:37 WARN: Is a cable unplugged on ucast eth0? heartbeat[3025]: 2009/06/02_14:03:14 WARN: node xen-a1.fra1.mailcluster: is dead heartbeat[3025]: 2009/06/02_14:03:14 info: Comm_now_up(): updating status to active heartbeat[3025]: 2009/06/02_14:03:14 info: Local status now set to: 'active' heartbeat[3025]: 2009/06/02_14:03:14 WARN: No STONITH device configured. heartbeat[3025]: 2009/06/02_14:03:14 WARN: Shared disks are not protected. heartbeat[3025]: 2009/06/02_14:03:14 info: Resources being acquired from xen-a1.fra1.mailcluster. harc[5874]: 2009/06/02_14:03:14 info: Running /etc/ha.d/rc.d/status status heartbeat[5875]: 2009/06/02_14:03:14 info: Local Resource acquisition completed. mach_down[5915]: 2009/06/02_14:03:14 info: Taking over resource group drbddisk::drbd_backend ResourceManager[5958]: 2009/06/02_14:03:14 info: Acquiring resource group: xen-a1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1 ResourceManager[5958]: 2009/06/02_14:03:14 info: Running /etc/ha.d/resource.d/drbddisk drbd_backend start ResourceManager[5958]: 2009/06/02_14:03:14 info: Running /etc/ha.d/resource.d/xen backend-A1 start mach_down[5915]: 2009/06/02_14:03:16 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down[5915]: 2009/06/02_14:03:16 info: mach_down takeover complete for node xen-a1.fra1.mailcluster. heartbeat[3025]: 2009/06/02_14:03:16 info: mach_down takeover complete. heartbeat[3025]: 2009/06/02_14:03:16 info: Initial resource acquisition complete (mach_down) harc[6221]: 2009/06/02_14:03:16 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp ip-request-resp[6221]: 2009/06/02_14:03:16 received ip-request-resp drbddisk::drbd_backend_2 OK yes ResourceManager[6242]: 2009/06/02_14:03:16 info: Acquiring resource group: xen-b1.fra1.mailcluster drbddisk::drbd_backend_2 xen::backend-B1 ResourceManager[6242]: 2009/06/02_14:03:16 info: Running /etc/ha.d/resource.d/drbddisk drbd_backend_2 start ResourceManager[6242]: 2009/06/02_14:03:16 info: Running /etc/ha.d/resource.d/xen backend-B1 start heartbeat[3025]: 2009/06/02_14:03:24 info: Local Resource acquisition completed. (none) heartbeat[3025]: 2009/06/02_14:03:24 info: local resource transition completed. heartbeat[3062]: 2009/06/02_14:32:27 ERROR: write_child: Exiting due to persistent errors: No such device heartbeat[3064]: 2009/06/02_14:32:27 ERROR: write_child: Exiting due to persistent errors: No such device heartbeat[3025]: 2009/06/02_14:32:27 WARN: Managed HBWRITE process 3062 exited with return code 1. heartbeat[3025]: 2009/06/02_14:32:27 ERROR: HBWRITE process died. Beginning communications restart process for comm channel 0. heartbeat[3025]: 2009/06/02_14:32:27 WARN: Managed HBREAD process 3063 killed by signal 9 [SIGKILL - Kill, unblockable]. heartbeat[3025]: 2009/06/02_14:32:27 ERROR: Both comm processes for channel 0 have died. Restarting. heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0 heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: bound send socket to device: eth0 heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: bound receive socket to device: eth0 heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: started on port 694 interface eth0 to 172.20.1.1 heartbeat[3025]: 2009/06/02_14:32:27 info: Communications restart succeeded. heartbeat[3025]: 2009/06/02_14:32:27 WARN: Managed HBWRITE process 3064 exited with return code 1. heartbeat[3025]: 2009/06/02_14:32:27 ERROR: HBWRITE process died. Beginning communications restart process for comm channel 1. heartbeat[3025]: 2009/06/02_14:32:27 WARN: Managed HBREAD process 3065 killed by signal 9 [SIGKILL - Kill, unblockable]. heartbeat[3025]: 2009/06/02_14:32:27 ERROR: Both comm processes for channel 1 have died. Restarting. heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0 heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: bound send socket to device: eth0 heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: bound receive socket to device: eth0 heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: started on port 694 interface eth0 to 172.20.2.1 heartbeat[3025]: 2009/06/02_14:32:27 info: Communications restart succeeded. heartbeat[3025]: 2009/06/02_14:32:27 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD took too long to execute: 100 ms (> 30 ms) (GSource: 0x14358388) heartbeat[3025]: 2009/06/02_14:32:28 info: Link xen-a1.fra1.mailcluster:eth0 up. heartbeat[3025]: 2009/06/02_14:32:28 info: Status update for node xen-a1.fra1.mailcluster: status active harc[13992]: 2009/06/02_14:32:28 info: Running /etc/ha.d/rc.d/status status heartbeat[3025]: 2009/06/02_14:32:31 info: all clients are now paused heartbeat[3025]: 2009/06/02_14:32:31 info: all clients are now resumed shortly before the /var/log/messages show me: Jun 2 13:58:48 xen-B1 kernel: drbd1: PingAck did not arrive in time. Jun 2 13:58:48 xen-B1 kernel: drbd1: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Jun 2 13:58:48 xen-B1 kernel: drbd1: asender terminated Jun 2 13:58:48 xen-B1 kernel: drbd1: Terminating asender thread Jun 2 13:58:48 xen-B1 kernel: drbd1: short read expecting header on sock: r=-512 Jun 2 13:58:48 xen-B1 kernel: drbd1: Creating new current UUID Jun 2 13:58:48 xen-B1 kernel: drbd1: Writing meta data super block now. Jun 2 13:58:48 xen-B1 kernel: drbd1: tl_clear() Jun 2 13:58:48 xen-B1 kernel: drbd1: Connection closed Jun 2 13:58:48 xen-B1 kernel: drbd1: conn( NetworkFailure -> Unconnected ) Jun 2 13:58:48 xen-B1 kernel: drbd1: receiver terminated Jun 2 13:58:48 xen-B1 kernel: drbd1: Restarting receiver thread Jun 2 13:58:48 xen-B1 kernel: drbd1: receiver (re)started Jun 2 13:58:48 xen-B1 kernel: drbd1: conn( Unconnected -> WFConnection ) Jun 2 13:58:54 xen-B1 kernel: drbd0: PingAck did not arrive in time. Jun 2 13:58:54 xen-B1 kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Jun 2 13:58:54 xen-B1 kernel: drbd0: asender terminated Jun 2 13:58:54 xen-B1 kernel: drbd0: Terminating asender thread Jun 2 13:58:54 xen-B1 kernel: drbd0: short read receiving data: read 2984 expected 4096 Jun 2 13:58:54 xen-B1 kernel: drbd0: error receiving Data, l: 20504! Jun 2 13:58:54 xen-B1 kernel: drbd0: Writing meta data super block now. Jun 2 13:58:54 xen-B1 kernel: drbd0: tl_clear() Jun 2 13:58:54 xen-B1 kernel: drbd0: Connection closed Jun 2 13:58:54 xen-B1 kernel: drbd0: conn( NetworkFailure -> Unconnected ) Jun 2 13:58:54 xen-B1 kernel: drbd0: receiver terminated Jun 2 13:58:54 xen-B1 kernel: drbd0: Restarting receiver thread Jun 2 13:58:54 xen-B1 kernel: drbd0: receiver (re)started Jun 2 13:58:54 xen-B1 kernel: drbd0: conn( Unconnected -> WFConnection ) Jun 2 13:58:58 xen-B1 kernel: drbd1: connect failed, err = -115 Jun 2 13:59:04 xen-B1 kernel: drbd0: connect failed, err = -115 Jun 2 13:59:08 xen-B1 kernel: drbd1: connect failed, err = -115 Jun 2 13:59:14 xen-B1 kernel: drbd0: connect failed, err = -115 Jun 2 13:59:18 xen-B1 kernel: drbd1: connect failed, err = -115 Jun 2 13:59:24 xen-B1 kernel: drbd0: connect failed, err = -115 Jun 2 13:59:38 xen-B1 kernel: drbd1: connect failed, err = -115 Jun 2 13:59:42 xen-B1 heartbeat: [14405]: WARN: node xen-a1.fra1.mailcluster: is dead and this from the 1st server: heartbeat[13948]: 2009/06/02_12:31:54 WARN: 1 lost packet(s) for [xen-b1.fra1.mailcluster] [600686:600688] heartbeat[13948]: 2009/06/02_12:31:54 info: No pkts missing from xen-b1.fra1.mailcluster! heartbeat[3032]: 2009/06/02_14:02:59 info: Version 2 support: false heartbeat[3032]: 2009/06/02_14:02:59 WARN: Logging daemon is disabled --enabling logging daemon is recommended heartbeat[3032]: 2009/06/02_14:02:59 info: ************************** heartbeat[3032]: 2009/06/02_14:02:59 info: Configuration validated. Starting heartbeat 2.1.3 heartbeat[3033]: 2009/06/02_14:02:59 info: heartbeat: version 2.1.3 heartbeat[3033]: 2009/06/02_14:02:59 info: Heartbeat generation: 1202824052 heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0 heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: bound send socket to device: eth0 heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: bound receive socket to device: eth0 heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: started on port 694 interface eth0 to 172.20.1.1 heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0 heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: bound send socket to device: eth0 heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: bound receive socket to device: eth0 heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: started on port 694 interface eth0 to 172.20.2.1 heartbeat[3033]: 2009/06/02_14:02:59 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[3033]: 2009/06/02_14:02:59 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[3033]: 2009/06/02_14:02:59 info: G_main_add_SignalHandler: Added signal handler for signal 17 heartbeat[3033]: 2009/06/02_14:02:59 info: Local status now set to: 'up' heartbeat[3033]: 2009/06/02_14:03:01 info: Link xen-b1.fra1.mailcluster:eth0 up. heartbeat[3033]: 2009/06/02_14:03:01 info: Status update for node xen-b1.fra1.mailcluster: status active heartbeat[3033]: 2009/06/02_14:03:02 WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 440 ms (> 50 ms) (GSource: 0x1a65ca18) harc[3318]: 2009/06/02_14:03:02 info: Running /etc/ha.d/rc.d/status status heartbeat[3033]: 2009/06/02_14:03:02 info: Comm_now_up(): updating status to active heartbeat[3033]: 2009/06/02_14:03:02 info: Local status now set to: 'active' heartbeat[3033]: 2009/06/02_14:03:02 WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 480 ms (> 50 ms) (GSource: 0x1a65ca18) heartbeat[3033]: 2009/06/02_14:03:03 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 690 ms (> 100 ms) before being called (GSource: 0x1a65c388) heartbeat[3033]: 2009/06/02_14:03:03 info: G_SIG_dispatch: started at 1718069668 should have started at 1718069599 heartbeat[3033]: 2009/06/02_14:03:03 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD took too long to execute: 280 ms (> 30 ms) (GSource: 0x1a65c388) heartbeat[3033]: 2009/06/02_14:03:03 info: remote resource transition completed. heartbeat[3033]: 2009/06/02_14:03:03 info: remote resource transition completed. heartbeat[3033]: 2009/06/02_14:03:03 info: Local Resource acquisition completed. (none) heartbeat[3033]: 2009/06/02_14:03:03 WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 420 ms (> 50 ms) (GSource: 0x1a65ca18) heartbeat[3033]: 2009/06/02_14:03:04 info: xen-b1.fra1.mailcluster wants to go standby [foreign] heartbeat[3033]: 2009/06/02_14:03:04 WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 60 ms (> 50 ms) (GSource: 0x1a65ca18) heartbeat[3070]: 2009/06/02_14:03:08 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3070]: 2009/06/02_14:03:08 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3072]: 2009/06/02_14:03:08 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3072]: 2009/06/02_14:03:08 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3070]: 2009/06/02_14:03:10 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3070]: 2009/06/02_14:03:10 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3072]: 2009/06/02_14:03:10 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3072]: 2009/06/02_14:03:10 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3070]: 2009/06/02_14:03:12 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3070]: 2009/06/02_14:03:12 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3072]: 2009/06/02_14:03:12 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3072]: 2009/06/02_14:03:12 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3070]: 2009/06/02_14:03:14 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3070]: 2009/06/02_14:03:14 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3072]: 2009/06/02_14:03:14 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3072]: 2009/06/02_14:03:14 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3070]: 2009/06/02_14:03:16 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3070]: 2009/06/02_14:03:16 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3070]: 2009/06/02_14:03:16 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3070]: 2009/06/02_14:03:16 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3072]: 2009/06/02_14:03:16 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3072]: 2009/06/02_14:03:16 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3072]: 2009/06/02_14:03:16 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3072]: 2009/06/02_14:03:16 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3070]: 2009/06/02_14:03:18 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3070]: 2009/06/02_14:03:18 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3072]: 2009/06/02_14:03:18 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3072]: 2009/06/02_14:03:18 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3070]: 2009/06/02_14:03:20 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3070]: 2009/06/02_14:03:20 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3072]: 2009/06/02_14:03:20 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3072]: 2009/06/02_14:03:20 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3070]: 2009/06/02_14:03:22 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3070]: 2009/06/02_14:03:22 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3072]: 2009/06/02_14:03:22 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3072]: 2009/06/02_14:03:22 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3070]: 2009/06/02_14:03:24 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3070]: 2009/06/02_14:03:24 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3072]: 2009/06/02_14:03:24 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3072]: 2009/06/02_14:03:24 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3070]: 2009/06/02_14:03:26 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3070]: 2009/06/02_14:03:26 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3070]: 2009/06/02_14:03:26 WARN: Temporarily Suppressing write error messages heartbeat[3070]: 2009/06/02_14:03:26 WARN: Is a cable unplugged on ucast eth0? heartbeat[3072]: 2009/06/02_14:03:26 ERROR: glib: Unable to send [-1] ucast packet: No such device heartbeat[3072]: 2009/06/02_14:03:26 ERROR: write_child: write failure on ucast eth0.: No such device heartbeat[3072]: 2009/06/02_14:03:26 WARN: Temporarily Suppressing write error messages heartbeat[3072]: 2009/06/02_14:03:26 WARN: Is a cable unplugged on ucast eth0? heartbeat[3033]: 2009/06/02_14:04:05 WARN: node xen-b1.fra1.mailcluster: is dead heartbeat[3033]: 2009/06/02_14:04:05 info: Cancelling pending standby operation heartbeat[3033]: 2009/06/02_14:04:05 WARN: No STONITH device configured. heartbeat[3033]: 2009/06/02_14:04:05 WARN: Shared disks are not protected. heartbeat[3033]: 2009/06/02_14:04:05 info: Resources being acquired from xen-b1.fra1.mailcluster. heartbeat[3033]: 2009/06/02_14:04:05 info: Link xen-b1.fra1.mailcluster:eth0 dead. harc[5354]: 2009/06/02_14:04:05 info: Running /etc/ha.d/rc.d/status status heartbeat[5355]: 2009/06/02_14:04:05 info: Local Resource acquisition completed. heartbeat[3033]: 2009/06/02_14:04:05 info: Initial resource acquisition complete (T_RESOURCES(us)) mach_down[5384]: 2009/06/02_14:04:05 info: Taking over resource group drbddisk::drbd_backend_2 ResourceManager[5438]: 2009/06/02_14:04:06 info: Acquiring resource group: xen-b1.fra1.mailcluster drbddisk::drbd_backend_2 xen::backend-B1 ResourceManager[5438]: 2009/06/02_14:04:06 info: Running /etc/ha.d/resource.d/drbddisk drbd_backend_2 start ResourceManager[5438]: 2009/06/02_14:04:07 info: Running /etc/ha.d/resource.d/xen backend-B1 start mach_down[5384]: 2009/06/02_14:04:09 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down[5384]: 2009/06/02_14:04:09 info: mach_down takeover complete for node xen-b1.fra1.mailcluster. heartbeat[3033]: 2009/06/02_14:04:09 info: mach_down takeover complete. harc[5796]: 2009/06/02_14:04:09 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp ip-request-resp[5796]: 2009/06/02_14:04:09 received ip-request-resp drbddisk::drbd_backend OK yes ResourceManager[5818]: 2009/06/02_14:04:09 info: Acquiring resource group: xen-a1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1 ResourceManager[5818]: 2009/06/02_14:04:09 info: Running /etc/ha.d/resource.d/drbddisk drbd_backend start ResourceManager[5818]: 2009/06/02_14:04:15 ERROR: Return code 1 from /etc/ha.d/resource.d/drbddisk ResourceManager[5818]: 2009/06/02_14:04:15 CRIT: Giving up resources due to failure of drbddisk::drbd_backend ResourceManager[5818]: 2009/06/02_14:04:16 info: Releasing resource group: xen-a1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1 ResourceManager[5818]: 2009/06/02_14:04:16 info: Running /etc/ha.d/resource.d/xen backend-A1 stop ResourceManager[5818]: 2009/06/02_14:04:16 info: Running /etc/ha.d/resource.d/drbddisk drbd_backend stop hb_standby[6175]: 2009/06/02_14:04:46 Going standby [foreign]. heartbeat[3033]: 2009/06/02_14:04:47 info: xen-a1.fra1.mailcluster wants to go standby [foreign] heartbeat[3033]: 2009/06/02_14:04:58 WARN: No reply to standby request. Standby request cancelled. heartbeat[3033]: 2009/06/02_14:03:48 info: Clock jumped backwards. Compensating. heartbeat[3070]: 2009/06/02_14:30:50 ERROR: write_child: Exiting due to persistent errors: No such device heartbeat[3072]: 2009/06/02_14:30:50 ERROR: write_child: Exiting due to persistent errors: No such device heartbeat[3033]: 2009/06/02_14:30:50 WARN: Managed HBWRITE process 3070 exited with return code 1. heartbeat[3033]: 2009/06/02_14:30:50 ERROR: HBWRITE process died. Beginning communications restart process for comm channel 0. heartbeat[3033]: 2009/06/02_14:30:50 WARN: Managed HBREAD process 3071 killed by signal 9 [SIGKILL - Kill, unblockable]. heartbeat[3033]: 2009/06/02_14:30:50 ERROR: Both comm processes for channel 0 have died. Restarting. heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0 heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: bound send socket to device: eth0 heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: bound receive socket to device: eth0 heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: started on port 694 interface eth0 to 172.20.1.1 heartbeat[3033]: 2009/06/02_14:30:50 info: Communications restart succeeded. heartbeat[3033]: 2009/06/02_14:30:50 WARN: Managed HBWRITE process 3072 exited with return code 1. heartbeat[3033]: 2009/06/02_14:30:50 ERROR: HBWRITE process died. Beginning communications restart process for comm channel 1. heartbeat[3033]: 2009/06/02_14:30:50 WARN: Managed HBREAD process 3073 killed by signal 9 [SIGKILL - Kill, unblockable]. heartbeat[3033]: 2009/06/02_14:30:50 ERROR: Both comm processes for channel 1 have died. Restarting. heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0 heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: bound send socket to device: eth0 heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: bound receive socket to device: eth0 heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: started on port 694 interface eth0 to 172.20.2.1 heartbeat[3033]: 2009/06/02_14:30:50 info: Communications restart succeeded. heartbeat[3033]: 2009/06/02_14:30:50 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD took too long to execute: 40 ms (> 30 ms) (GSource: 0x1a65c388) heartbeat[3033]: 2009/06/02_14:32:29 info: Heartbeat restart on node xen-b1.fra1.mailcluster heartbeat[3033]: 2009/06/02_14:32:29 info: Link xen-b1.fra1.mailcluster:eth0 up. heartbeat[3033]: 2009/06/02_14:32:29 info: Status update for node xen-b1.fra1.mailcluster: status active harc[12603]: 2009/06/02_14:32:29 info: Running /etc/ha.d/rc.d/status status heartbeat[3033]: 2009/06/02_14:32:32 info: all clients are now paused heartbeat[3033]: 2009/06/02_14:32:37 info: all clients are now resumed heartbeat[3033]: 2009/06/03_06:04:32 WARN: 1 lost packet(s) for [xen-b1.fra1.mailcluster] [28878:28880] heartbeat[3033]: 2009/06/03_06:04:32 info: No pkts missing from xen-b1.fra1.mailcluster! heartbeat[3033]: 2009/06/03_07:18:58 WARN: 1 lost packet(s) for [xen-b1.fra1.mailcluster] [31111:31113] heartbeat[3033]: 2009/06/03_07:18:58 info: No pkts missing from xen-b1.fra1.mailcluster! heartbeat[3033]: 2009/06/03_07:36:58 WARN: 1 lost packet(s) for [xen-b1.fra1.mailcluster] [31651:31653] heartbeat[3033]: 2009/06/03_07:36:58 info: No pkts missing from xen-b1.fra1.mailcluster! heartbeat[3033]: 2009/06/03_07:52:52 WARN: 1 lost packet(s) for [xen-b1.fra1.mailcluster] [32128:32130] heartbeat[3033]: 2009/06/03_07:52:52 info: No pkts missing from xen-b1.fra1.mailcluster! cheers