Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, at my site we have an ha-nfs server with drbd+heartbeat that
recently "failed to fail-over" to the secondary node. All the tests
prior to production and after this error didn't have the same behaviour.
What we guess had happened is that for some reason heartbeat
detected and initiated the fail-over process before drbd could go on a
WFconnection state.
From what we've understood drbd should by default detect a failure
within 6 to 16 seconds, but in our case it took around 30 seconds
(16:43:11 -> 16:43:41 there is some seconds of delay considering remote
syslogging).
Any comments, questions and enlightening are obviously more than
appreciated.
cheers,
Massimo
os: rhel-4.6
kernel: 2.6.9-55.0.2.ELsmp
drbd: 0.7.24
heartbeat: 2.1.2-3
server-0000:
eth0: 10.5.1.2/16
eth1: 192.168.0.1/24
server-0001:
eth0: 10.5.1.3/16
eth1: 192.168.0.2/24
fail-over IP: 10.5.2.1/16 (fs-cluster)
drbd.conf:
resource drbd-resource-0 {
protocol C;
incon-degr-cmd "killall heartbeat";
startup {
wfc-timeout 120;
degr-wfc-timeout 120; # 2 minutes.
}
disk {
on-io-error panic;
}
syncer {
rate 10M;
}
on server-0000 {
device /dev/drbd0;
disk /dev/vg00/nfs;
address 10.5.1.2:7789; meta-disk internal;
}
on server-0001 {
device /dev/drbd0;
disk /dev/vg00/nfs;
address 10.5.1.3:7789;
meta-disk internal;
}
}
# drbdsetup /dev/drbd0 show
Lower device: 253:04 (dm-4)
Meta device: internal
Disk options:
on-io-error = panic
Local address: 10.5.1.2:7789
Remote address: 10.5.1.3:7789
Wire protocol: C
Net options:
timeout = 6.0 sec (default)
connect-int = 10 sec (default)
ping-int = 10 sec (default)
max-epoch-size = 2048 (default)
max-buffers = 2048 (default)
unplug-watermark = 128 (default)
sndbuf-size = 131070 (default)
ko-count = 0 (default)
Syncer options:
rate = 10240 KB/sec
group = 0 (default)
al-extents = 127 (default)
ha.cf:
logfacility local0
keepalive 2
deadtime 30
warntime 10
initdead 120
udpport 694
ucast eth0 10.5.1.2
ucast eth1 192.168.0.1
ucast eth0 10.5.1.3
ucast eth1 192.168.0.2
auto_failback off
watchdog /dev/watchdog
node server-0000
node server-0001
debug 0
use_logd no
haresources:
server-0000 IPaddr::10.5.2.1/16/eth0 drbddisk::drbd-resource-0
Filesystem::/dev/drbd0::/nfs::ext3 killnfsd nfs nfslock
/var/log/messages:
Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: Daily informational
memory statistics
Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: MSG stats:
10/24560866 ms age 0 [pid3586/MST_CONTROL]
Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: cl_malloc stats:
821/803285302 141060/71222 [pid3586/MST_CONTROL]
Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: RealMalloc stats:
168756 total malloc bytes. pid [3586/MST_CONTROL]
Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: Current arena value: 0
Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: MSG stats: 0/1 ms
age 766170962 [pid3626/HBFIFO]
Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: cl_malloc stats:
447/512 46732/24700 [pid3626/HBFIFO]
Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: RealMalloc stats:
47972 total malloc bytes. pid [3626/HBFIFO]
Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: Current arena value: 0
Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: MSG stats: 0/0 ms
age 765428326 [pid3631/HBWRITE]
Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: cl_malloc stats:
451/8668904 47580/25428 [pid3631/HBWRITE]
Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: RealMalloc stats:
56032 total malloc bytes. pid [3631/HBWRITE]
Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: Current arena value: 0
Feb 11 16:43:11 server-0000 heartbeat: [3586]: info: MSG stats: 0/0 ms
age 765432156 [pid3632/HBREAD]
[...server-0000 dies...]
Feb 11 16:43:27 server-0001 heartbeat: [3483]: info: Link
server-0000:eth0 dead.
Feb 11 16:43:27 server-0001 heartbeat: [3483]: info: Link
server-0000:eth1 dead.
Feb 11 16:43:29 server-0001 heartbeat: [3483]: WARN: node server-0000:
is dead
Feb 11 16:43:29 server-0001 heartbeat: [3483]: WARN: No STONITH device
configured.
Feb 11 16:43:29 server-0001 heartbeat: [3483]: WARN: Shared disks are
not protected.
Feb 11 16:43:29 server-0001 heartbeat: [3483]: info: Resources being
acquired from server-0000.
Feb 11 16:43:29 server-0001 harc[23487]: info: Running
/etc/ha.d/rc.d/status status
Feb 11 16:43:29 server-0001 heartbeat: [23488]: info: No local resources
[/usr/share/heartbeat/ResourceManager listkeys server-0001] to acquire.
Feb 11 16:43:29 server-0001 mach_down[23516]: info: Taking over resource
group IPaddr::10.5.2.1/16/eth0
Feb 11 16:43:29 server-0001 ResourceManager[23542]: info: Acquiring
resource group: server-0000 IPaddr::10.5.2.1/16/eth0
drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/nfs::ext3 killnfsd
nfs nfslock
Feb 11 16:43:29 server-0001 IPaddr[23569]: INFO: Resource is stopped
Feb 11 16:43:29 server-0001 ResourceManager[23542]: info: Running
/etc/ha.d/resource.d/IPaddr 10.5.2.1/16/eth0 start
Feb 11 16:43:29 server-0001 IPaddr[23667]: INFO: Using calculated
netmask for 10.5.2.1: 255.255.0.0
Feb 11 16:43:29 server-0001 IPaddr[23667]: INFO: eval ifconfig eth0:1
10.5.2.1 netmask 255.255.0.0 broadcast 10.5.255.255
Feb 11 16:43:29 server-0001 IPaddr[23638]: INFO: Success
Feb 11 16:43:29 server-0001 ResourceManager[23542]: info: Running
/etc/ha.d/resource.d/drbddisk drbd-resource-0 start
Feb 11 16:43:35 server-0001 ResourceManager[23542]: ERROR: Return code
20 from /etc/ha.d/resource.d/drbddisk
Feb 11 16:43:35 server-0001 ResourceManager[23542]: CRIT: Giving up
resources due to failure of drbddisk::drbd-resource-0
Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Releasing
resource group: server-0000 IPaddr::10.5.2.1/16/eth0
drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/nfs::ext3 killnfsd
nfs nfslock
Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running
/etc/init.d/nfslock stop
Feb 11 16:43:35 server-0001 nfslock: rpc.statd shutdown failed
Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running
/etc/init.d/nfs stop
Feb 11 16:43:35 server-0001 nfs: rpc.mountd shutdown failed
Feb 11 16:43:35 server-0001 nfs: nfsd shutdown failed
Feb 11 16:43:35 server-0001 nfs: rpc.rquotad shutdown failed
Feb 11 16:43:35 server-0001 exportfs[23925]: could not open
/var/lib/nfs/etab for locking
Feb 11 16:43:35 server-0001 exportfs[23925]: can't lock
/var/lib/nfs/etab for writing
Feb 11 16:43:35 server-0001 exportfs: exportfs: could not open
/var/lib/nfs/etab for locking
Feb 11 16:43:35 server-0001 exportfs: exportfs: can't lock
/var/lib/nfs/etab for writing
Feb 11 16:43:35 server-0001 nfs: Shutting down NFS services: succeeded
Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running
/etc/ha.d/resource.d/killnfsd stop
Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running
/etc/ha.d/resource.d/Filesystem /dev/drbd0 /nfs ext3 stop
Feb 11 16:43:35 server-0001 Filesystem[23984]: INFO: Running stop for
/dev/drbd0 on /nfs
Feb 11 16:43:35 server-0001 Filesystem[23973]: INFO: Success
Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running
/etc/ha.d/resource.d/drbddisk drbd-resource-0 stop
Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running
/etc/ha.d/resource.d/IPaddr 10.5.2.1/16/eth0 stop
Feb 11 16:43:35 server-0001 IPaddr[24111]: INFO: ifconfig eth0:1 down
Feb 11 16:43:35 server-0001 IPaddr[24082]: INFO: Success
Feb 11 16:43:35 server-0001 mach_down[23516]: info:
/usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
Feb 11 16:43:35 server-0001 mach_down[23516]: info: mach_down takeover
complete for node server-0000.
Feb 11 16:43:35 server-0001 heartbeat: [3483]: info: mach_down takeover
complete.
Feb 11 16:43:36 server-0004 kernel: nfs: server fs-cluster not
responding, still trying
Feb 11 16:43:41 server-0001 kernel: drbd0: PingAck did not arrive in time.
Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_asender [27907]: cstate
Connected --> NetworkFailure
Feb 11 16:43:41 server-0001 kernel: drbd0: asender terminated
Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_receiver [3339]: cstate
NetworkFailure --> BrokenPipe
Feb 11 16:43:41 server-0001 kernel: drbd0: short read expecting header
on sock: r=-512
Feb 11 16:43:41 server-0001 kernel: drbd0: worker terminated
Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_receiver [3339]: cstate
BrokenPipe --> Unconnected
Feb 11 16:43:41 server-0001 kernel: drbd0: Connection lost.
Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_receiver [3339]: cstate
Unconnected --> WFConnection
Feb 11 16:44:05 server-0001 hb_standby[24463]: Going standby [foreign].
Feb 11 16:44:06 server-0001 heartbeat: [3483]: info: server-0001 wants
to go standby [foreign]
Feb 11 16:44:16 server-0001 heartbeat: [3483]: WARN: No reply to standby
request. Standby request cancelled.
[...server-0000 restarts...]
Feb 11 16:49:23 server-0000 syslogd 1.4.1: restart (remote reception).
Feb 11 16:49:23 server-0000 syslog: syslogd startup succeeded
Feb 11 16:49:23 server-0000 kernel: klogd 1.4.1, log source = /proc/kmsg
started.
Feb 11 16:49:23 server-0000 kernel: Linux version 2.6.9-55.0.2.ELsmp
(brewbuilder at hs20-bc2-3.build.redhat.com) (gcc version 3.4.6 20060404
(Red Hat 3.4.6-8)) #1 SMP Tue Jun 12 17:59:08 EDT 2007
[...]
Feb 11 16:49:46 server-0000 network: Setting network parameters: succeeded
Feb 11 16:49:46 server-0000 network: Bringing up loopback interface:
succeeded
Feb 11 16:49:46 server-0000 network: Bringing up interface eth0: succeeded
Feb 11 16:49:46 server-0000 drbd: [
Feb 11 16:49:47 server-0000 network: Bringing up interface eth1: succeeded
Feb 11 16:49:47 server-0000 kernel: drbd0: resync bitmap: bits=26181632
words=818176
Feb 11 16:49:47 server-0000 drbd: d0
Feb 11 16:49:47 server-0000 kernel: drbd0: size = 99 GB (104726528 KB)
Feb 11 16:49:48 server-0000 kernel: drbd0: 0 KB marked out-of-sync by on
disk bit-map.
Feb 11 16:49:48 server-0000 kernel: drbd0: Found 4 transactions (192
active extents) in activity log.
Feb 11 16:49:48 server-0000 kernel: drbd0: Marked additional 508 MB as
out-of-sync based on AL.
Feb 11 16:49:48 server-0000 kernel: drbd0: drbdsetup [3508]: cstate
Unconfigured --> StandAlone
Feb 11 16:49:48 server-0000 drbd: s0
Feb 11 16:49:48 server-0000 drbd: n0
Feb 11 16:49:48 server-0000 kernel: drbd0: drbdsetup [3521]: cstate
StandAlone --> Unconnected
Feb 11 16:49:48 server-0000 kernel: drbd0: drbd0_receiver [3522]: cstate
Unconnected --> WFConnection
Feb 11 16:49:48 server-0000 drbd: ].
Feb 11 16:49:48 server-0000 drbd: WARN: stdin/stdout is not a TTY; using
/dev/console
Feb 11 16:49:51 server-0001 kernel: drbd0: drbd0_receiver [3339]: cstate
WFConnection --> WFReportParams
Feb 11 16:49:51 server-0001 kernel: drbd0: Handshake successful: DRBD
Network Protocol version 74
Feb 11 16:49:51 server-0000 kernel: drbd0: drbd0_receiver [3522]: cstate
WFConnection --> WFReportParams
Feb 11 16:49:51 server-0001 kernel: drbd0: Connection established.
Feb 11 16:49:51 server-0000 kernel: drbd0: Handshake successful: DRBD
Network Protocol version 74
Feb 11 16:49:51 server-0001 kernel: drbd0: I am(S):
1:00000002:00000001:00000043:00000001:01
Feb 11 16:49:51 server-0000 kernel: drbd0: Connection established.
Feb 11 16:49:51 server-0001 kernel: drbd0: Peer(S):
1:00000002:00000001:00000043:00000001:11
Feb 11 16:49:51 server-0001 kernel: drbd0: drbd0_receiver [3339]: cstate
WFReportParams --> WFBitMapT
Feb 11 16:49:51 server-0000 kernel: drbd0: I am(S):
1:00000002:00000001:00000043:00000001:11
Feb 11 16:49:51 server-0001 kernel: drbd0: Secondary/Unknown -->
Secondary/Secondary
Feb 11 16:49:51 server-0001 kernel: drbd0: drbd0_receiver [3339]: cstate
WFBitMapT --> SyncTarget
Feb 11 16:49:51 server-0000 kernel: drbd0: Peer(S):
1:00000002:00000001:00000043:00000001:01
Feb 11 16:49:51 server-0001 kernel: drbd0: Resync started as SyncTarget
(need to sync 520192 KB [130048 bits set]).
Feb 11 16:49:51 server-0000 kernel: drbd0: drbd0_receiver [3522]: cstate
WFReportParams --> WFBitMapS
Feb 11 16:49:51 server-0000 rc: Starting drbd: succeeded
Feb 11 16:49:51 server-0000 kernel: drbd0: Secondary/Unknown -->
Secondary/Secondary
Feb 11 16:49:51 server-0000 kernel: drbd0: drbd0_receiver [3522]: cstate
WFBitMapS --> SyncSource
Feb 11 16:49:51 server-0000 kernel: drbd0: Resync started as SyncSource
(need to sync 520192 KB [130048 bits set]).
Feb 11 16:49:51 server-0000 logd: [3547]: info: logd started with
default configuration.
Feb 11 16:49:52 server-0000 logd: [3555]: info:
G_main_add_SignalHandler: Added signal handler for signal 15
Feb 11 16:49:52 server-0000 logd: [3547]: info:
G_main_add_SignalHandler: Added signal handler for signal 15
Feb 11 16:49:52 server-0000 kernel: Software Watchdog Timer: 0.07
initialized. soft_noboot=0 soft_margin=60 sec (nowayout= 0)
Feb 11 16:49:52 server-0000 heartbeat: [3668]: WARN: Logging daemon is
disabled --enabling logging daemon is recommended
Feb 11 16:49:52 server-0000 heartbeat: [3668]: info: Version 2 support:
false
Feb 11 16:49:52 server-0000 heartbeat: [3668]: WARN: Logging daemon is
disabled --enabling logging daemon is recommended
Feb 11 16:49:52 server-0000 heartbeat: [3668]: info:
**************************
Feb 11 16:49:52 server-0000 heartbeat: [3668]: info: Configuration
validated. Starting heartbeat 2.1.2
Feb 11 16:49:52 server-0000 heartbeat: [3671]: info: heartbeat: version
2.1.2
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: Heartbeat
generation: 1187008426
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info:
G_main_add_TriggerHandler: Added signal manual handler
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info:
G_main_add_TriggerHandler: Added signal manual handler
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: Removing
/var/run/heartbeat/rsctmp failed, recreating.
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: write
socket priority set to IPTOS_LOWDELAY on eth0
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: bound
send socket to device: eth0
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: bound
receive socket to device: eth0
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast:
started on port 694 interface eth0 to 10.5.1.2
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: write
socket priority set to IPTOS_LOWDELAY on eth1
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: bound
send socket to device: eth1
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: bound
receive socket to device: eth1
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast:
started on port 694 interface eth1 to 192.168.0.1
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: write
socket priority set to IPTOS_LOWDELAY on eth0
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: bound
send socket to device: eth0
Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: bound
receive socket to device: eth0
Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast:
started on port 694 interface eth0 to 10.5.1.3
Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast: write
socket priority set to IPTOS_LOWDELAY on eth1
Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast: bound
send socket to device: eth1
Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast: bound
receive socket to device: eth1
Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Heartbeat restart
on node server-0000
Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Link
server-0000:eth0 up.
Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast:
started on port 694 interface eth1 to 192.168.0.2
Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Status update for
node server-0000: status init
Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Status update for
node server-0000: status up
Feb 11 16:49:54 server-0000 heartbeat: [3671]: notice: Using watchdog
device: /dev/watchdog
Feb 11 16:49:54 server-0001 harc[28317]: info: Running
/etc/ha.d/rc.d/status status
Feb 11 16:49:54 server-0001 harc[28334]: info: Running
/etc/ha.d/rc.d/status status
Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: all clients are now
paused
Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Link
server-0000:eth1 up.
Feb 11 16:49:55 server-0000 heartbeat: [3671]: info:
G_main_add_SignalHandler: Added signal handler for signal 17
Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Local status now
set to: 'up'
Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Link
server-0001:eth0 up.
Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Status update for
node server-0001: status active
Feb 11 16:49:55 server-0000 harc[3772]: info: Running
/etc/ha.d/rc.d/status status
Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Link
server-0001:eth1 up.
Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Comm_now_up():
updating status to active
Feb 11 16:49:55 server-0001 heartbeat: [3483]: info: Status update for
node server-0000: status active
Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: Local status now
set to: 'active'
Feb 11 16:49:56 server-0000 heartbeat: [3671]: WARN: G_CH_dispatch_int:
Dispatch function for read child took too long to execute: 620 ms (> 50
ms) (GSource: 0x982dc38)
Feb 11 16:49:56 server-0000 heartbeat: [3671]: WARN: G_CH_dispatch_int:
Dispatch function for read child was delayed 620 ms (> 500 ms) before
being called (GSource: 0x982dd68)
Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: G_CH_dispatch_int:
started at 429426259 should have started at 429426197
Feb 11 16:49:55 server-0001 harc[28350]: info: Running
/etc/ha.d/rc.d/status status
Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: remote resource
transition completed.
Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: remote resource
transition completed.
Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: Local Resource
acquisition completed. (none)
Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: Initial resource
acquisition complete (T_RESOURCES(them))
Feb 11 16:49:56 server-0001 heartbeat: [3483]: info: remote resource
transition completed.
Feb 11 16:49:56 server-0001 heartbeat: [3483]: info: all clients are now
resumed