[DRBD-user] frequnet crashes/reboots on a drbd/ha/xen setup

Heiko rupertt at gmail.com
Wed Jun 3 11:59:35 CEST 2009


2009/6/3 Ivars Strazdiņš <ivars.strazdins at gmail.com>

> Heiko,
> any hint from you about actual crash itself (stack trace) as well as OS
> and software versions involved (xen/heartbeat/drbd etc.)?
> BR,
> Ivars
>

Hello,

drbd is version 8.0.13
heartbeat is 2.1.3
xen is 3.0.3

this is the ha logfile from the 2nd server, but this one did also a reboot.

heartbeat[14405]: 2009/06/02_13:59:42 WARN: node xen-a1.fra1.mailcluster: is
dead
heartbeat[14405]: 2009/06/02_13:59:42 WARN: No STONITH device configured.
heartbeat[14405]: 2009/06/02_13:59:42 WARN: Shared disks are not protected.
heartbeat[14405]: 2009/06/02_13:59:42 info: Resources being acquired from
xen-a1.fra1.mailcluster.
heartbeat[14405]: 2009/06/02_13:59:42 info: Link
xen-a1.fra1.mailcluster:eth0 dead.
harc[22620]:    2009/06/02_13:59:42 info: Running /etc/ha.d/rc.d/status
status
mach_down[22653]:    2009/06/02_13:59:42 info: Taking over resource group
drbddisk::drbd_backend
heartbeat[22621]: 2009/06/02_13:59:42 info: Local Resource acquisition
completed.
ResourceManager[22699]:    2009/06/02_13:59:43 info: Acquiring resource
group: xen-a1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1
ResourceManager[22699]:    2009/06/02_13:59:43 info: Running
/etc/ha.d/resource.d/drbddisk drbd_backend start
ResourceManager[22699]:    2009/06/02_13:59:43 info: Running
/etc/ha.d/resource.d/xen backend-A1 start
mach_down[22653]:    2009/06/02_13:59:45 info: /usr/share/heartbeat/mach_down:
nice_failback: foreign resources acquired
mach_down[22653]:    2009/06/02_13:59:45 info: mach_down takeover complete
for node xen-a1.fra1.mailcluster.
heartbeat[14405]: 2009/06/02_13:59:45 info: mach_down takeover complete.
heartbeat[14405]: 2009/06/02_14:00:35 info: Heartbeat restart on node
xen-a1.fra1.mailcluster
heartbeat[14405]: 2009/06/02_14:00:35 info: Link
xen-a1.fra1.mailcluster:eth0 up.
heartbeat[14405]: 2009/06/02_14:00:35 info: Status update for node
xen-a1.fra1.mailcluster: status init
heartbeat[14405]: 2009/06/02_14:00:35 info: Status update for node
xen-a1.fra1.mailcluster: status up
harc[23036]:    2009/06/02_14:00:35 info: Running /etc/ha.d/rc.d/status
status
harc[23052]:    2009/06/02_14:00:35 info: Running /etc/ha.d/rc.d/status
status
heartbeat[14405]: 2009/06/02_14:00:37 info: Status update for node
xen-a1.fra1.mailcluster: status active
harc[23068]:    2009/06/02_14:00:37 info: Running /etc/ha.d/rc.d/status
status
heartbeat[14405]: 2009/06/02_14:00:38 info: remote resource transition
completed.
heartbeat[14405]: 2009/06/02_14:00:38 info: xen-b1.fra1.mailcluster wants to
go standby [foreign]
heartbeat[14405]: 2009/06/02_14:00:38 info: standby: xen-a1.fra1.mailcluster
can take our foreign resources
heartbeat[23084]: 2009/06/02_14:00:38 info: give up foreign HA resources
(standby).
ResourceManager[23097]:    2009/06/02_14:00:38 info: Releasing resource
group: xen-a1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1
ResourceManager[23097]:    2009/06/02_14:00:38 info: Running
/etc/ha.d/resource.d/xen backend-A1 stop
heartbeat[3024]: 2009/06/02_14:01:13 info: Version 2 support: false
heartbeat[3024]: 2009/06/02_14:01:13 WARN: Logging daemon is disabled
--enabling logging daemon is recommended
heartbeat[3024]: 2009/06/02_14:01:13 info: **************************
heartbeat[3024]: 2009/06/02_14:01:13 info: Configuration validated. Starting
heartbeat 2.1.3
heartbeat[3025]: 2009/06/02_14:01:13 info: heartbeat: version 2.1.3
heartbeat[3025]: 2009/06/02_14:01:13 info: Heartbeat generation: 1202824473
heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth0
heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: bound send socket to
device: eth0
heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: bound receive socket
to device: eth0
heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: started on port 694
interface eth0 to 172.20.1.1
heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth0
heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: bound send socket to
device: eth0
heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: bound receive socket
to device: eth0
heartbeat[3025]: 2009/06/02_14:01:13 info: glib: ucast: started on port 694
interface eth0 to 172.20.2.1
heartbeat[3025]: 2009/06/02_14:01:13 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[3025]: 2009/06/02_14:01:13 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[3025]: 2009/06/02_14:01:13 info: G_main_add_SignalHandler: Added
signal handler for signal 17
heartbeat[3025]: 2009/06/02_14:01:13 info: Local status now set to: 'up'
heartbeat[3064]: 2009/06/02_14:01:19 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3064]: 2009/06/02_14:01:19 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3062]: 2009/06/02_14:01:19 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3062]: 2009/06/02_14:01:19 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3064]: 2009/06/02_14:01:21 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3064]: 2009/06/02_14:01:21 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3062]: 2009/06/02_14:01:21 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3062]: 2009/06/02_14:01:21 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3064]: 2009/06/02_14:01:23 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3064]: 2009/06/02_14:01:23 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3062]: 2009/06/02_14:01:23 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3062]: 2009/06/02_14:01:23 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3064]: 2009/06/02_14:01:25 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3064]: 2009/06/02_14:01:25 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3062]: 2009/06/02_14:01:25 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3062]: 2009/06/02_14:01:25 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3064]: 2009/06/02_14:01:27 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3064]: 2009/06/02_14:01:27 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3062]: 2009/06/02_14:01:27 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3062]: 2009/06/02_14:01:27 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3064]: 2009/06/02_14:01:29 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3064]: 2009/06/02_14:01:29 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3062]: 2009/06/02_14:01:29 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3062]: 2009/06/02_14:01:29 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3064]: 2009/06/02_14:01:31 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3064]: 2009/06/02_14:01:31 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3062]: 2009/06/02_14:01:31 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3062]: 2009/06/02_14:01:31 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3064]: 2009/06/02_14:01:33 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3064]: 2009/06/02_14:01:33 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3062]: 2009/06/02_14:01:33 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3062]: 2009/06/02_14:01:33 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3064]: 2009/06/02_14:01:35 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3064]: 2009/06/02_14:01:35 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3062]: 2009/06/02_14:01:35 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3062]: 2009/06/02_14:01:35 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3064]: 2009/06/02_14:01:37 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3064]: 2009/06/02_14:01:37 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3064]: 2009/06/02_14:01:37 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3064]: 2009/06/02_14:01:37 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3064]: 2009/06/02_14:01:37 WARN: Temporarily Suppressing write
error messages
heartbeat[3064]: 2009/06/02_14:01:37 WARN: Is a cable unplugged on ucast
eth0?
heartbeat[3062]: 2009/06/02_14:01:37 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3062]: 2009/06/02_14:01:37 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3062]: 2009/06/02_14:01:37 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3062]: 2009/06/02_14:01:37 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3062]: 2009/06/02_14:01:37 WARN: Temporarily Suppressing write
error messages
heartbeat[3062]: 2009/06/02_14:01:37 WARN: Is a cable unplugged on ucast
eth0?
heartbeat[3025]: 2009/06/02_14:03:14 WARN: node xen-a1.fra1.mailcluster: is
dead
heartbeat[3025]: 2009/06/02_14:03:14 info: Comm_now_up(): updating status to
active
heartbeat[3025]: 2009/06/02_14:03:14 info: Local status now set to: 'active'
heartbeat[3025]: 2009/06/02_14:03:14 WARN: No STONITH device configured.
heartbeat[3025]: 2009/06/02_14:03:14 WARN: Shared disks are not protected.
heartbeat[3025]: 2009/06/02_14:03:14 info: Resources being acquired from
xen-a1.fra1.mailcluster.
harc[5874]:    2009/06/02_14:03:14 info: Running /etc/ha.d/rc.d/status
status
heartbeat[5875]: 2009/06/02_14:03:14 info: Local Resource acquisition
completed.
mach_down[5915]:    2009/06/02_14:03:14 info: Taking over resource group
drbddisk::drbd_backend
ResourceManager[5958]:    2009/06/02_14:03:14 info: Acquiring resource
group: xen-a1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1
ResourceManager[5958]:    2009/06/02_14:03:14 info: Running
/etc/ha.d/resource.d/drbddisk drbd_backend start
ResourceManager[5958]:    2009/06/02_14:03:14 info: Running
/etc/ha.d/resource.d/xen backend-A1 start
mach_down[5915]:    2009/06/02_14:03:16 info:
/usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
mach_down[5915]:    2009/06/02_14:03:16 info: mach_down takeover complete
for node xen-a1.fra1.mailcluster.
heartbeat[3025]: 2009/06/02_14:03:16 info: mach_down takeover complete.
heartbeat[3025]: 2009/06/02_14:03:16 info: Initial resource acquisition
complete (mach_down)
harc[6221]:    2009/06/02_14:03:16 info: Running
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
ip-request-resp[6221]:    2009/06/02_14:03:16 received ip-request-resp
drbddisk::drbd_backend_2 OK yes
ResourceManager[6242]:    2009/06/02_14:03:16 info: Acquiring resource
group: xen-b1.fra1.mailcluster drbddisk::drbd_backend_2 xen::backend-B1
ResourceManager[6242]:    2009/06/02_14:03:16 info: Running
/etc/ha.d/resource.d/drbddisk drbd_backend_2 start
ResourceManager[6242]:    2009/06/02_14:03:16 info: Running
/etc/ha.d/resource.d/xen backend-B1 start
heartbeat[3025]: 2009/06/02_14:03:24 info: Local Resource acquisition
completed. (none)
heartbeat[3025]: 2009/06/02_14:03:24 info: local resource transition
completed.
heartbeat[3062]: 2009/06/02_14:32:27 ERROR: write_child: Exiting due to
persistent errors: No such device
heartbeat[3064]: 2009/06/02_14:32:27 ERROR: write_child: Exiting due to
persistent errors: No such device
heartbeat[3025]: 2009/06/02_14:32:27 WARN: Managed HBWRITE process 3062
exited with return code 1.
heartbeat[3025]: 2009/06/02_14:32:27 ERROR: HBWRITE process died.  Beginning
communications restart process for comm channel 0.
heartbeat[3025]: 2009/06/02_14:32:27 WARN: Managed HBREAD process 3063
killed by signal 9 [SIGKILL - Kill, unblockable].
heartbeat[3025]: 2009/06/02_14:32:27 ERROR: Both comm processes for channel
0 have died.  Restarting.
heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth0
heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: bound send socket to
device: eth0
heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: bound receive socket
to device: eth0
heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: started on port 694
interface eth0 to 172.20.1.1
heartbeat[3025]: 2009/06/02_14:32:27 info: Communications restart succeeded.
heartbeat[3025]: 2009/06/02_14:32:27 WARN: Managed HBWRITE process 3064
exited with return code 1.
heartbeat[3025]: 2009/06/02_14:32:27 ERROR: HBWRITE process died.  Beginning
communications restart process for comm channel 1.
heartbeat[3025]: 2009/06/02_14:32:27 WARN: Managed HBREAD process 3065
killed by signal 9 [SIGKILL - Kill, unblockable].
heartbeat[3025]: 2009/06/02_14:32:27 ERROR: Both comm processes for channel
1 have died.  Restarting.
heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth0
heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: bound send socket to
device: eth0
heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: bound receive socket
to device: eth0
heartbeat[3025]: 2009/06/02_14:32:27 info: glib: ucast: started on port 694
interface eth0 to 172.20.2.1
heartbeat[3025]: 2009/06/02_14:32:27 info: Communications restart succeeded.
heartbeat[3025]: 2009/06/02_14:32:27 WARN: G_SIG_dispatch: Dispatch function
for SIGCHLD took too long to execute: 100 ms (> 30 ms) (GSource: 0x14358388)
heartbeat[3025]: 2009/06/02_14:32:28 info: Link xen-a1.fra1.mailcluster:eth0
up.
heartbeat[3025]: 2009/06/02_14:32:28 info: Status update for node
xen-a1.fra1.mailcluster: status active
harc[13992]:    2009/06/02_14:32:28 info: Running /etc/ha.d/rc.d/status
status
heartbeat[3025]: 2009/06/02_14:32:31 info: all clients are now paused
heartbeat[3025]: 2009/06/02_14:32:31 info: all clients are now resumed


shortly before the /var/log/messages show me:

Jun  2 13:58:48 xen-B1 kernel: drbd1: PingAck did not arrive in time.
Jun  2 13:58:48 xen-B1 kernel: drbd1: peer( Secondary -> Unknown ) conn(
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Jun  2 13:58:48 xen-B1 kernel: drbd1: asender terminated
Jun  2 13:58:48 xen-B1 kernel: drbd1: Terminating asender thread
Jun  2 13:58:48 xen-B1 kernel: drbd1: short read expecting header on sock:
r=-512
Jun  2 13:58:48 xen-B1 kernel: drbd1: Creating new current UUID
Jun  2 13:58:48 xen-B1 kernel: drbd1: Writing meta data super block now.
Jun  2 13:58:48 xen-B1 kernel: drbd1: tl_clear()
Jun  2 13:58:48 xen-B1 kernel: drbd1: Connection closed
Jun  2 13:58:48 xen-B1 kernel: drbd1: conn( NetworkFailure -> Unconnected )
Jun  2 13:58:48 xen-B1 kernel: drbd1: receiver terminated
Jun  2 13:58:48 xen-B1 kernel: drbd1: Restarting receiver thread
Jun  2 13:58:48 xen-B1 kernel: drbd1: receiver (re)started
Jun  2 13:58:48 xen-B1 kernel: drbd1: conn( Unconnected -> WFConnection )
Jun  2 13:58:54 xen-B1 kernel: drbd0: PingAck did not arrive in time.
Jun  2 13:58:54 xen-B1 kernel: drbd0: peer( Primary -> Unknown ) conn(
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Jun  2 13:58:54 xen-B1 kernel: drbd0: asender terminated
Jun  2 13:58:54 xen-B1 kernel: drbd0: Terminating asender thread
Jun  2 13:58:54 xen-B1 kernel: drbd0: short read receiving data: read 2984
expected 4096
Jun  2 13:58:54 xen-B1 kernel: drbd0: error receiving Data, l: 20504!
Jun  2 13:58:54 xen-B1 kernel: drbd0: Writing meta data super block now.
Jun  2 13:58:54 xen-B1 kernel: drbd0: tl_clear()
Jun  2 13:58:54 xen-B1 kernel: drbd0: Connection closed
Jun  2 13:58:54 xen-B1 kernel: drbd0: conn( NetworkFailure -> Unconnected )
Jun  2 13:58:54 xen-B1 kernel: drbd0: receiver terminated
Jun  2 13:58:54 xen-B1 kernel: drbd0: Restarting receiver thread
Jun  2 13:58:54 xen-B1 kernel: drbd0: receiver (re)started
Jun  2 13:58:54 xen-B1 kernel: drbd0: conn( Unconnected -> WFConnection )
Jun  2 13:58:58 xen-B1 kernel: drbd1: connect failed, err = -115
Jun  2 13:59:04 xen-B1 kernel: drbd0: connect failed, err = -115
Jun  2 13:59:08 xen-B1 kernel: drbd1: connect failed, err = -115
Jun  2 13:59:14 xen-B1 kernel: drbd0: connect failed, err = -115
Jun  2 13:59:18 xen-B1 kernel: drbd1: connect failed, err = -115
Jun  2 13:59:24 xen-B1 kernel: drbd0: connect failed, err = -115
Jun  2 13:59:38 xen-B1 kernel: drbd1: connect failed, err = -115
Jun  2 13:59:42 xen-B1 heartbeat: [14405]: WARN: node
xen-a1.fra1.mailcluster: is dead



and this from the 1st server:


heartbeat[13948]: 2009/06/02_12:31:54 WARN: 1 lost packet(s) for
[xen-b1.fra1.mailcluster] [600686:600688]
heartbeat[13948]: 2009/06/02_12:31:54 info: No pkts missing from
xen-b1.fra1.mailcluster!
heartbeat[3032]: 2009/06/02_14:02:59 info: Version 2 support: false
heartbeat[3032]: 2009/06/02_14:02:59 WARN: Logging daemon is disabled
--enabling logging daemon is recommended
heartbeat[3032]: 2009/06/02_14:02:59 info: **************************
heartbeat[3032]: 2009/06/02_14:02:59 info: Configuration validated. Starting
heartbeat 2.1.3
heartbeat[3033]: 2009/06/02_14:02:59 info: heartbeat: version 2.1.3
heartbeat[3033]: 2009/06/02_14:02:59 info: Heartbeat generation: 1202824052
heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth0
heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: bound send socket to
device: eth0
heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: bound receive socket
to device: eth0
heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: started on port 694
interface eth0 to 172.20.1.1
heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth0
heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: bound send socket to
device: eth0
heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: bound receive socket
to device: eth0
heartbeat[3033]: 2009/06/02_14:02:59 info: glib: ucast: started on port 694
interface eth0 to 172.20.2.1
heartbeat[3033]: 2009/06/02_14:02:59 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[3033]: 2009/06/02_14:02:59 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[3033]: 2009/06/02_14:02:59 info: G_main_add_SignalHandler: Added
signal handler for signal 17
heartbeat[3033]: 2009/06/02_14:02:59 info: Local status now set to: 'up'
heartbeat[3033]: 2009/06/02_14:03:01 info: Link xen-b1.fra1.mailcluster:eth0
up.
heartbeat[3033]: 2009/06/02_14:03:01 info: Status update for node
xen-b1.fra1.mailcluster: status active
heartbeat[3033]: 2009/06/02_14:03:02 WARN: G_CH_dispatch_int: Dispatch
function for read child took too long to execute: 440 ms (> 50 ms) (GSource:
0x1a65ca18)
harc[3318]:    2009/06/02_14:03:02 info: Running /etc/ha.d/rc.d/status
status
heartbeat[3033]: 2009/06/02_14:03:02 info: Comm_now_up(): updating status to
active
heartbeat[3033]: 2009/06/02_14:03:02 info: Local status now set to: 'active'
heartbeat[3033]: 2009/06/02_14:03:02 WARN: G_CH_dispatch_int: Dispatch
function for read child took too long to execute: 480 ms (> 50 ms) (GSource:
0x1a65ca18)
heartbeat[3033]: 2009/06/02_14:03:03 WARN: G_SIG_dispatch: Dispatch function
for SIGCHLD was delayed 690 ms (> 100 ms) before being called (GSource:
0x1a65c388)
heartbeat[3033]: 2009/06/02_14:03:03 info: G_SIG_dispatch: started at
1718069668 should have started at 1718069599
heartbeat[3033]: 2009/06/02_14:03:03 WARN: G_SIG_dispatch: Dispatch function
for SIGCHLD took too long to execute: 280 ms (> 30 ms) (GSource: 0x1a65c388)
heartbeat[3033]: 2009/06/02_14:03:03 info: remote resource transition
completed.
heartbeat[3033]: 2009/06/02_14:03:03 info: remote resource transition
completed.
heartbeat[3033]: 2009/06/02_14:03:03 info: Local Resource acquisition
completed. (none)
heartbeat[3033]: 2009/06/02_14:03:03 WARN: G_CH_dispatch_int: Dispatch
function for read child took too long to execute: 420 ms (> 50 ms) (GSource:
0x1a65ca18)
heartbeat[3033]: 2009/06/02_14:03:04 info: xen-b1.fra1.mailcluster wants to
go standby [foreign]
heartbeat[3033]: 2009/06/02_14:03:04 WARN: G_CH_dispatch_int: Dispatch
function for read child took too long to execute: 60 ms (> 50 ms) (GSource:
0x1a65ca18)
heartbeat[3070]: 2009/06/02_14:03:08 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3070]: 2009/06/02_14:03:08 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3072]: 2009/06/02_14:03:08 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3072]: 2009/06/02_14:03:08 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3070]: 2009/06/02_14:03:10 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3070]: 2009/06/02_14:03:10 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3072]: 2009/06/02_14:03:10 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3072]: 2009/06/02_14:03:10 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3070]: 2009/06/02_14:03:12 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3070]: 2009/06/02_14:03:12 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3072]: 2009/06/02_14:03:12 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3072]: 2009/06/02_14:03:12 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3070]: 2009/06/02_14:03:14 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3070]: 2009/06/02_14:03:14 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3072]: 2009/06/02_14:03:14 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3072]: 2009/06/02_14:03:14 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3070]: 2009/06/02_14:03:16 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3070]: 2009/06/02_14:03:16 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3070]: 2009/06/02_14:03:16 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3070]: 2009/06/02_14:03:16 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3072]: 2009/06/02_14:03:16 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3072]: 2009/06/02_14:03:16 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3072]: 2009/06/02_14:03:16 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3072]: 2009/06/02_14:03:16 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3070]: 2009/06/02_14:03:18 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3070]: 2009/06/02_14:03:18 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3072]: 2009/06/02_14:03:18 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3072]: 2009/06/02_14:03:18 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3070]: 2009/06/02_14:03:20 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3070]: 2009/06/02_14:03:20 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3072]: 2009/06/02_14:03:20 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3072]: 2009/06/02_14:03:20 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3070]: 2009/06/02_14:03:22 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3070]: 2009/06/02_14:03:22 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3072]: 2009/06/02_14:03:22 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3072]: 2009/06/02_14:03:22 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3070]: 2009/06/02_14:03:24 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3070]: 2009/06/02_14:03:24 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3072]: 2009/06/02_14:03:24 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3072]: 2009/06/02_14:03:24 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3070]: 2009/06/02_14:03:26 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3070]: 2009/06/02_14:03:26 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3070]: 2009/06/02_14:03:26 WARN: Temporarily Suppressing write
error messages
heartbeat[3070]: 2009/06/02_14:03:26 WARN: Is a cable unplugged on ucast
eth0?
heartbeat[3072]: 2009/06/02_14:03:26 ERROR: glib: Unable to send [-1] ucast
packet: No such device
heartbeat[3072]: 2009/06/02_14:03:26 ERROR: write_child: write failure on
ucast eth0.: No such device
heartbeat[3072]: 2009/06/02_14:03:26 WARN: Temporarily Suppressing write
error messages
heartbeat[3072]: 2009/06/02_14:03:26 WARN: Is a cable unplugged on ucast
eth0?
heartbeat[3033]: 2009/06/02_14:04:05 WARN: node xen-b1.fra1.mailcluster: is
dead
heartbeat[3033]: 2009/06/02_14:04:05 info: Cancelling pending standby
operation
heartbeat[3033]: 2009/06/02_14:04:05 WARN: No STONITH device configured.
heartbeat[3033]: 2009/06/02_14:04:05 WARN: Shared disks are not protected.
heartbeat[3033]: 2009/06/02_14:04:05 info: Resources being acquired from
xen-b1.fra1.mailcluster.
heartbeat[3033]: 2009/06/02_14:04:05 info: Link xen-b1.fra1.mailcluster:eth0
dead.
harc[5354]:    2009/06/02_14:04:05 info: Running /etc/ha.d/rc.d/status
status
heartbeat[5355]: 2009/06/02_14:04:05 info: Local Resource acquisition
completed.
heartbeat[3033]: 2009/06/02_14:04:05 info: Initial resource acquisition
complete (T_RESOURCES(us))
mach_down[5384]:    2009/06/02_14:04:05 info: Taking over resource group
drbddisk::drbd_backend_2
ResourceManager[5438]:    2009/06/02_14:04:06 info: Acquiring resource
group: xen-b1.fra1.mailcluster drbddisk::drbd_backend_2 xen::backend-B1
ResourceManager[5438]:    2009/06/02_14:04:06 info: Running
/etc/ha.d/resource.d/drbddisk drbd_backend_2 start
ResourceManager[5438]:    2009/06/02_14:04:07 info: Running
/etc/ha.d/resource.d/xen backend-B1 start
mach_down[5384]:    2009/06/02_14:04:09 info:
/usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
mach_down[5384]:    2009/06/02_14:04:09 info: mach_down takeover complete
for node xen-b1.fra1.mailcluster.
heartbeat[3033]: 2009/06/02_14:04:09 info: mach_down takeover complete.
harc[5796]:    2009/06/02_14:04:09 info: Running
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
ip-request-resp[5796]:    2009/06/02_14:04:09 received ip-request-resp
drbddisk::drbd_backend OK yes
ResourceManager[5818]:    2009/06/02_14:04:09 info: Acquiring resource
group: xen-a1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1
ResourceManager[5818]:    2009/06/02_14:04:09 info: Running
/etc/ha.d/resource.d/drbddisk drbd_backend start
ResourceManager[5818]:    2009/06/02_14:04:15 ERROR: Return code 1 from
/etc/ha.d/resource.d/drbddisk
ResourceManager[5818]:    2009/06/02_14:04:15 CRIT: Giving up resources due
to failure of drbddisk::drbd_backend
ResourceManager[5818]:    2009/06/02_14:04:16 info: Releasing resource
group: xen-a1.fra1.mailcluster drbddisk::drbd_backend xen::backend-A1
ResourceManager[5818]:    2009/06/02_14:04:16 info: Running
/etc/ha.d/resource.d/xen backend-A1 stop
ResourceManager[5818]:    2009/06/02_14:04:16 info: Running
/etc/ha.d/resource.d/drbddisk drbd_backend stop
hb_standby[6175]:    2009/06/02_14:04:46 Going standby [foreign].
heartbeat[3033]: 2009/06/02_14:04:47 info: xen-a1.fra1.mailcluster wants to
go standby [foreign]
heartbeat[3033]: 2009/06/02_14:04:58 WARN: No reply to standby request.
Standby request cancelled.
heartbeat[3033]: 2009/06/02_14:03:48 info: Clock jumped backwards.
Compensating.
heartbeat[3070]: 2009/06/02_14:30:50 ERROR: write_child: Exiting due to
persistent errors: No such device
heartbeat[3072]: 2009/06/02_14:30:50 ERROR: write_child: Exiting due to
persistent errors: No such device
heartbeat[3033]: 2009/06/02_14:30:50 WARN: Managed HBWRITE process 3070
exited with return code 1.
heartbeat[3033]: 2009/06/02_14:30:50 ERROR: HBWRITE process died.  Beginning
communications restart process for comm channel 0.
heartbeat[3033]: 2009/06/02_14:30:50 WARN: Managed HBREAD process 3071
killed by signal 9 [SIGKILL - Kill, unblockable].
heartbeat[3033]: 2009/06/02_14:30:50 ERROR: Both comm processes for channel
0 have died.  Restarting.
heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth0
heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: bound send socket to
device: eth0
heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: bound receive socket
to device: eth0
heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: started on port 694
interface eth0 to 172.20.1.1
heartbeat[3033]: 2009/06/02_14:30:50 info: Communications restart succeeded.
heartbeat[3033]: 2009/06/02_14:30:50 WARN: Managed HBWRITE process 3072
exited with return code 1.
heartbeat[3033]: 2009/06/02_14:30:50 ERROR: HBWRITE process died.  Beginning
communications restart process for comm channel 1.
heartbeat[3033]: 2009/06/02_14:30:50 WARN: Managed HBREAD process 3073
killed by signal 9 [SIGKILL - Kill, unblockable].
heartbeat[3033]: 2009/06/02_14:30:50 ERROR: Both comm processes for channel
1 have died.  Restarting.
heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth0
heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: bound send socket to
device: eth0
heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: bound receive socket
to device: eth0
heartbeat[3033]: 2009/06/02_14:30:50 info: glib: ucast: started on port 694
interface eth0 to 172.20.2.1
heartbeat[3033]: 2009/06/02_14:30:50 info: Communications restart succeeded.
heartbeat[3033]: 2009/06/02_14:30:50 WARN: G_SIG_dispatch: Dispatch function
for SIGCHLD took too long to execute: 40 ms (> 30 ms) (GSource: 0x1a65c388)
heartbeat[3033]: 2009/06/02_14:32:29 info: Heartbeat restart on node
xen-b1.fra1.mailcluster
heartbeat[3033]: 2009/06/02_14:32:29 info: Link xen-b1.fra1.mailcluster:eth0
up.
heartbeat[3033]: 2009/06/02_14:32:29 info: Status update for node
xen-b1.fra1.mailcluster: status active
harc[12603]:    2009/06/02_14:32:29 info: Running /etc/ha.d/rc.d/status
status
heartbeat[3033]: 2009/06/02_14:32:32 info: all clients are now paused
heartbeat[3033]: 2009/06/02_14:32:37 info: all clients are now resumed
heartbeat[3033]: 2009/06/03_06:04:32 WARN: 1 lost packet(s) for
[xen-b1.fra1.mailcluster] [28878:28880]
heartbeat[3033]: 2009/06/03_06:04:32 info: No pkts missing from
xen-b1.fra1.mailcluster!
heartbeat[3033]: 2009/06/03_07:18:58 WARN: 1 lost packet(s) for
[xen-b1.fra1.mailcluster] [31111:31113]
heartbeat[3033]: 2009/06/03_07:18:58 info: No pkts missing from
xen-b1.fra1.mailcluster!
heartbeat[3033]: 2009/06/03_07:36:58 WARN: 1 lost packet(s) for
[xen-b1.fra1.mailcluster] [31651:31653]
heartbeat[3033]: 2009/06/03_07:36:58 info: No pkts missing from
xen-b1.fra1.mailcluster!
heartbeat[3033]: 2009/06/03_07:52:52 WARN: 1 lost packet(s) for
[xen-b1.fra1.mailcluster] [32128:32130]
heartbeat[3033]: 2009/06/03_07:52:52 info: No pkts missing from
xen-b1.fra1.mailcluster!


cheers


More information about the drbd-user mailing list