Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Anyone so kind to have a look at this?
cheers
Massimo
Massimo Mongardini wrote:
> Hi, at my site we have an ha-nfs server with drbd+heartbeat that
> recently "failed to fail-over" to the secondary node. All the tests
> prior to production and after this error didn't have the same behaviour.
> What we guess had happened is that for some reason heartbeat
> detected and initiated the fail-over process before drbd could go on a
> WFconnection state.
> From what we've understood drbd should by default detect a failure
> within 6 to 16 seconds, but in our case it took around 30 seconds
> (16:43:11 -> 16:43:41 there is some seconds of delay considering
> remote syslogging).
> Any comments, questions and enlightening are obviously more than
> appreciated.
>
> cheers,
> Massimo
>
> os: rhel-4.6
> kernel: 2.6.9-55.0.2.ELsmp
> drbd: 0.7.24
> heartbeat: 2.1.2-3
>
> server-0000:
> eth0: 10.5.1.2/16
> eth1: 192.168.0.1/24
>
> server-0001:
> eth0: 10.5.1.3/16
> eth1: 192.168.0.2/24
>
> fail-over IP: 10.5.2.1/16 (fs-cluster)
>
> drbd.conf:
>
> resource drbd-resource-0 {
> protocol C;
> incon-degr-cmd "killall heartbeat";
> startup {
> wfc-timeout 120;
> degr-wfc-timeout 120; # 2 minutes.
> }
> disk {
> on-io-error panic;
> }
> syncer {
> rate 10M;
> }
> on server-0000 {
> device /dev/drbd0;
> disk /dev/vg00/nfs;
> address 10.5.1.2:7789; meta-disk internal;
> }
> on server-0001 {
> device /dev/drbd0;
> disk /dev/vg00/nfs;
> address 10.5.1.3:7789;
> meta-disk internal;
> }
> }
>
> # drbdsetup /dev/drbd0 show
> Lower device: 253:04 (dm-4)
> Meta device: internal
> Disk options:
> on-io-error = panic
> Local address: 10.5.1.2:7789
> Remote address: 10.5.1.3:7789
> Wire protocol: C
> Net options:
> timeout = 6.0 sec (default)
> connect-int = 10 sec (default)
> ping-int = 10 sec (default)
> max-epoch-size = 2048 (default)
> max-buffers = 2048 (default)
> unplug-watermark = 128 (default)
> sndbuf-size = 131070 (default)
> ko-count = 0 (default)
> Syncer options:
> rate = 10240 KB/sec
> group = 0 (default)
> al-extents = 127 (default)
>
>
> ha.cf:
>
> logfacility local0
> keepalive 2
> deadtime 30
> warntime 10
> initdead 120
> udpport 694
> ucast eth0 10.5.1.2
> ucast eth1 192.168.0.1
> ucast eth0 10.5.1.3
> ucast eth1 192.168.0.2
> auto_failback off
> watchdog /dev/watchdog
> node server-0000
> node server-0001
> debug 0
> use_logd no
>
> haresources:
>
> server-0000 IPaddr::10.5.2.1/16/eth0 drbddisk::drbd-resource-0
> Filesystem::/dev/drbd0::/nfs::ext3 killnfsd nfs nfslock
>
>
> /var/log/messages:
>
> Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: Daily
> informational memory statistics
> Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: MSG stats:
> 10/24560866 ms age 0 [pid3586/MST_CONTROL]
> Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: cl_malloc stats:
> 821/803285302 141060/71222 [pid3586/MST_CONTROL]
> Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: RealMalloc stats:
> 168756 total malloc bytes. pid [3586/MST_CONTROL]
> Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: Current arena
> value: 0
> Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: MSG stats: 0/1 ms
> age 766170962 [pid3626/HBFIFO]
> Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: cl_malloc stats:
> 447/512 46732/24700 [pid3626/HBFIFO]
> Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: RealMalloc stats:
> 47972 total malloc bytes. pid [3626/HBFIFO]
> Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: Current arena
> value: 0
> Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: MSG stats: 0/0 ms
> age 765428326 [pid3631/HBWRITE]
> Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: cl_malloc stats:
> 451/8668904 47580/25428 [pid3631/HBWRITE]
> Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: RealMalloc stats:
> 56032 total malloc bytes. pid [3631/HBWRITE]
> Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: Current arena
> value: 0
> Feb 11 16:43:11 server-0000 heartbeat: [3586]: info: MSG stats: 0/0 ms
> age 765432156 [pid3632/HBREAD]
>
> [...server-0000 dies...]
>
> Feb 11 16:43:27 server-0001 heartbeat: [3483]: info: Link
> server-0000:eth0 dead.
> Feb 11 16:43:27 server-0001 heartbeat: [3483]: info: Link
> server-0000:eth1 dead.
> Feb 11 16:43:29 server-0001 heartbeat: [3483]: WARN: node server-0000:
> is dead
> Feb 11 16:43:29 server-0001 heartbeat: [3483]: WARN: No STONITH device
> configured.
> Feb 11 16:43:29 server-0001 heartbeat: [3483]: WARN: Shared disks are
> not protected.
> Feb 11 16:43:29 server-0001 heartbeat: [3483]: info: Resources being
> acquired from server-0000.
> Feb 11 16:43:29 server-0001 harc[23487]: info: Running
> /etc/ha.d/rc.d/status status
> Feb 11 16:43:29 server-0001 heartbeat: [23488]: info: No local
> resources [/usr/share/heartbeat/ResourceManager listkeys server-0001]
> to acquire.
> Feb 11 16:43:29 server-0001 mach_down[23516]: info: Taking over
> resource group IPaddr::10.5.2.1/16/eth0
> Feb 11 16:43:29 server-0001 ResourceManager[23542]: info: Acquiring
> resource group: server-0000 IPaddr::10.5.2.1/16/eth0
> drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/nfs::ext3 killnfsd
> nfs nfslock
> Feb 11 16:43:29 server-0001 IPaddr[23569]: INFO: Resource is stopped
> Feb 11 16:43:29 server-0001 ResourceManager[23542]: info: Running
> /etc/ha.d/resource.d/IPaddr 10.5.2.1/16/eth0 start
> Feb 11 16:43:29 server-0001 IPaddr[23667]: INFO: Using calculated
> netmask for 10.5.2.1: 255.255.0.0
> Feb 11 16:43:29 server-0001 IPaddr[23667]: INFO: eval ifconfig eth0:1
> 10.5.2.1 netmask 255.255.0.0 broadcast 10.5.255.255
> Feb 11 16:43:29 server-0001 IPaddr[23638]: INFO: Success
> Feb 11 16:43:29 server-0001 ResourceManager[23542]: info: Running
> /etc/ha.d/resource.d/drbddisk drbd-resource-0 start
> Feb 11 16:43:35 server-0001 ResourceManager[23542]: ERROR: Return code
> 20 from /etc/ha.d/resource.d/drbddisk
> Feb 11 16:43:35 server-0001 ResourceManager[23542]: CRIT: Giving up
> resources due to failure of drbddisk::drbd-resource-0
> Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Releasing
> resource group: server-0000 IPaddr::10.5.2.1/16/eth0
> drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/nfs::ext3 killnfsd
> nfs nfslock
> Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running
> /etc/init.d/nfslock stop
> Feb 11 16:43:35 server-0001 nfslock: rpc.statd shutdown failed
> Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running
> /etc/init.d/nfs stop
> Feb 11 16:43:35 server-0001 nfs: rpc.mountd shutdown failed
> Feb 11 16:43:35 server-0001 nfs: nfsd shutdown failed
> Feb 11 16:43:35 server-0001 nfs: rpc.rquotad shutdown failed
> Feb 11 16:43:35 server-0001 exportfs[23925]: could not open
> /var/lib/nfs/etab for locking
> Feb 11 16:43:35 server-0001 exportfs[23925]: can't lock
> /var/lib/nfs/etab for writing
> Feb 11 16:43:35 server-0001 exportfs: exportfs: could not open
> /var/lib/nfs/etab for locking
> Feb 11 16:43:35 server-0001 exportfs: exportfs: can't lock
> /var/lib/nfs/etab for writing
> Feb 11 16:43:35 server-0001 nfs: Shutting down NFS services: succeeded
> Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running
> /etc/ha.d/resource.d/killnfsd stop
> Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running
> /etc/ha.d/resource.d/Filesystem /dev/drbd0 /nfs ext3 stop
> Feb 11 16:43:35 server-0001 Filesystem[23984]: INFO: Running stop for
> /dev/drbd0 on /nfs
> Feb 11 16:43:35 server-0001 Filesystem[23973]: INFO: Success
> Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running
> /etc/ha.d/resource.d/drbddisk drbd-resource-0 stop
> Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running
> /etc/ha.d/resource.d/IPaddr 10.5.2.1/16/eth0 stop
> Feb 11 16:43:35 server-0001 IPaddr[24111]: INFO: ifconfig eth0:1 down
> Feb 11 16:43:35 server-0001 IPaddr[24082]: INFO: Success
> Feb 11 16:43:35 server-0001 mach_down[23516]: info:
> /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
> Feb 11 16:43:35 server-0001 mach_down[23516]: info: mach_down takeover
> complete for node server-0000.
> Feb 11 16:43:35 server-0001 heartbeat: [3483]: info: mach_down
> takeover complete.
> Feb 11 16:43:36 server-0004 kernel: nfs: server fs-cluster not
> responding, still trying
> Feb 11 16:43:41 server-0001 kernel: drbd0: PingAck did not arrive in
> time.
> Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_asender [27907]:
> cstate Connected --> NetworkFailure
> Feb 11 16:43:41 server-0001 kernel: drbd0: asender terminated
> Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_receiver [3339]:
> cstate NetworkFailure --> BrokenPipe
> Feb 11 16:43:41 server-0001 kernel: drbd0: short read expecting header
> on sock: r=-512
> Feb 11 16:43:41 server-0001 kernel: drbd0: worker terminated
> Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_receiver [3339]:
> cstate BrokenPipe --> Unconnected
> Feb 11 16:43:41 server-0001 kernel: drbd0: Connection lost.
> Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_receiver [3339]:
> cstate Unconnected --> WFConnection
> Feb 11 16:44:05 server-0001 hb_standby[24463]: Going standby [foreign].
> Feb 11 16:44:06 server-0001 heartbeat: [3483]: info: server-0001 wants
> to go standby [foreign]
> Feb 11 16:44:16 server-0001 heartbeat: [3483]: WARN: No reply to
> standby request. Standby request cancelled.
>
> [...server-0000 restarts...]
>
> Feb 11 16:49:23 server-0000 syslogd 1.4.1: restart (remote reception).
> Feb 11 16:49:23 server-0000 syslog: syslogd startup succeeded
> Feb 11 16:49:23 server-0000 kernel: klogd 1.4.1, log source =
> /proc/kmsg started.
> Feb 11 16:49:23 server-0000 kernel: Linux version 2.6.9-55.0.2.ELsmp
> (brewbuilder at hs20-bc2-3.build.redhat.com) (gcc version 3.4.6 20060404
> (Red Hat 3.4.6-8)) #1 SMP Tue Jun 12 17:59:08 EDT 2007
>
> [...]
>
> Feb 11 16:49:46 server-0000 network: Setting network parameters:
> succeeded
> Feb 11 16:49:46 server-0000 network: Bringing up loopback interface:
> succeeded
> Feb 11 16:49:46 server-0000 network: Bringing up interface eth0:
> succeeded
> Feb 11 16:49:46 server-0000 drbd: [
> Feb 11 16:49:47 server-0000 network: Bringing up interface eth1:
> succeeded
> Feb 11 16:49:47 server-0000 kernel: drbd0: resync bitmap:
> bits=26181632 words=818176
> Feb 11 16:49:47 server-0000 drbd: d0
> Feb 11 16:49:47 server-0000 kernel: drbd0: size = 99 GB (104726528 KB)
> Feb 11 16:49:48 server-0000 kernel: drbd0: 0 KB marked out-of-sync by
> on disk bit-map.
> Feb 11 16:49:48 server-0000 kernel: drbd0: Found 4 transactions (192
> active extents) in activity log.
> Feb 11 16:49:48 server-0000 kernel: drbd0: Marked additional 508 MB as
> out-of-sync based on AL.
> Feb 11 16:49:48 server-0000 kernel: drbd0: drbdsetup [3508]: cstate
> Unconfigured --> StandAlone
> Feb 11 16:49:48 server-0000 drbd: s0
> Feb 11 16:49:48 server-0000 drbd: n0
> Feb 11 16:49:48 server-0000 kernel: drbd0: drbdsetup [3521]: cstate
> StandAlone --> Unconnected
> Feb 11 16:49:48 server-0000 kernel: drbd0: drbd0_receiver [3522]:
> cstate Unconnected --> WFConnection
> Feb 11 16:49:48 server-0000 drbd: ].
> Feb 11 16:49:48 server-0000 drbd: WARN: stdin/stdout is not a TTY;
> using /dev/console
> Feb 11 16:49:51 server-0001 kernel: drbd0: drbd0_receiver [3339]:
> cstate WFConnection --> WFReportParams
> Feb 11 16:49:51 server-0001 kernel: drbd0: Handshake successful: DRBD
> Network Protocol version 74
> Feb 11 16:49:51 server-0000 kernel: drbd0: drbd0_receiver [3522]:
> cstate WFConnection --> WFReportParams
> Feb 11 16:49:51 server-0001 kernel: drbd0: Connection established.
> Feb 11 16:49:51 server-0000 kernel: drbd0: Handshake successful: DRBD
> Network Protocol version 74
> Feb 11 16:49:51 server-0001 kernel: drbd0: I am(S):
> 1:00000002:00000001:00000043:00000001:01
> Feb 11 16:49:51 server-0000 kernel: drbd0: Connection established.
> Feb 11 16:49:51 server-0001 kernel: drbd0: Peer(S):
> 1:00000002:00000001:00000043:00000001:11
> Feb 11 16:49:51 server-0001 kernel: drbd0: drbd0_receiver [3339]:
> cstate WFReportParams --> WFBitMapT
> Feb 11 16:49:51 server-0000 kernel: drbd0: I am(S):
> 1:00000002:00000001:00000043:00000001:11
> Feb 11 16:49:51 server-0001 kernel: drbd0: Secondary/Unknown -->
> Secondary/Secondary
> Feb 11 16:49:51 server-0001 kernel: drbd0: drbd0_receiver [3339]:
> cstate WFBitMapT --> SyncTarget
> Feb 11 16:49:51 server-0000 kernel: drbd0: Peer(S):
> 1:00000002:00000001:00000043:00000001:01
> Feb 11 16:49:51 server-0001 kernel: drbd0: Resync started as
> SyncTarget (need to sync 520192 KB [130048 bits set]).
> Feb 11 16:49:51 server-0000 kernel: drbd0: drbd0_receiver [3522]:
> cstate WFReportParams --> WFBitMapS
> Feb 11 16:49:51 server-0000 rc: Starting drbd: succeeded
> Feb 11 16:49:51 server-0000 kernel: drbd0: Secondary/Unknown -->
> Secondary/Secondary
> Feb 11 16:49:51 server-0000 kernel: drbd0: drbd0_receiver [3522]:
> cstate WFBitMapS --> SyncSource
> Feb 11 16:49:51 server-0000 kernel: drbd0: Resync started as
> SyncSource (need to sync 520192 KB [130048 bits set]).
> Feb 11 16:49:51 server-0000 logd: [3547]: info: logd started with
> default configuration.
> Feb 11 16:49:52 server-0000 logd: [3555]: info:
> G_main_add_SignalHandler: Added signal handler for signal 15
> Feb 11 16:49:52 server-0000 logd: [3547]: info:
> G_main_add_SignalHandler: Added signal handler for signal 15
> Feb 11 16:49:52 server-0000 kernel: Software Watchdog Timer: 0.07
> initialized. soft_noboot=0 soft_margin=60 sec (nowayout= 0)
> Feb 11 16:49:52 server-0000 heartbeat: [3668]: WARN: Logging daemon is
> disabled --enabling logging daemon is recommended
> Feb 11 16:49:52 server-0000 heartbeat: [3668]: info: Version 2
> support: false
> Feb 11 16:49:52 server-0000 heartbeat: [3668]: WARN: Logging daemon is
> disabled --enabling logging daemon is recommended
> Feb 11 16:49:52 server-0000 heartbeat: [3668]: info:
> **************************
> Feb 11 16:49:52 server-0000 heartbeat: [3668]: info: Configuration
> validated. Starting heartbeat 2.1.2
> Feb 11 16:49:52 server-0000 heartbeat: [3671]: info: heartbeat:
> version 2.1.2
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: Heartbeat
> generation: 1187008426
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info:
> G_main_add_TriggerHandler: Added signal manual handler
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info:
> G_main_add_TriggerHandler: Added signal manual handler
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: Removing
> /var/run/heartbeat/rsctmp failed, recreating.
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast:
> write socket priority set to IPTOS_LOWDELAY on eth0
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast:
> bound send socket to device: eth0
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast:
> bound receive socket to device: eth0
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast:
> started on port 694 interface eth0 to 10.5.1.2
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast:
> write socket priority set to IPTOS_LOWDELAY on eth1
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast:
> bound send socket to device: eth1
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast:
> bound receive socket to device: eth1
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast:
> started on port 694 interface eth1 to 192.168.0.1
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast:
> write socket priority set to IPTOS_LOWDELAY on eth0
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast:
> bound send socket to device: eth0
> Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast:
> bound receive socket to device: eth0
> Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast:
> started on port 694 interface eth0 to 10.5.1.3
> Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast:
> write socket priority set to IPTOS_LOWDELAY on eth1
> Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast:
> bound send socket to device: eth1
> Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast:
> bound receive socket to device: eth1
> Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Heartbeat restart
> on node server-0000
> Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Link
> server-0000:eth0 up.
> Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast:
> started on port 694 interface eth1 to 192.168.0.2
> Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Status update for
> node server-0000: status init
> Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Status update for
> node server-0000: status up
> Feb 11 16:49:54 server-0000 heartbeat: [3671]: notice: Using watchdog
> device: /dev/watchdog
> Feb 11 16:49:54 server-0001 harc[28317]: info: Running
> /etc/ha.d/rc.d/status status
> Feb 11 16:49:54 server-0001 harc[28334]: info: Running
> /etc/ha.d/rc.d/status status
> Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: all clients are
> now paused
> Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Link
> server-0000:eth1 up.
> Feb 11 16:49:55 server-0000 heartbeat: [3671]: info:
> G_main_add_SignalHandler: Added signal handler for signal 17
> Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Local status now
> set to: 'up'
> Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Link
> server-0001:eth0 up.
> Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Status update for
> node server-0001: status active
> Feb 11 16:49:55 server-0000 harc[3772]: info: Running
> /etc/ha.d/rc.d/status status
> Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Link
> server-0001:eth1 up.
> Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Comm_now_up():
> updating status to active
> Feb 11 16:49:55 server-0001 heartbeat: [3483]: info: Status update for
> node server-0000: status active
> Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: Local status now
> set to: 'active'
> Feb 11 16:49:56 server-0000 heartbeat: [3671]: WARN:
> G_CH_dispatch_int: Dispatch function for read child took too long to
> execute: 620 ms (> 50 ms) (GSource: 0x982dc38)
> Feb 11 16:49:56 server-0000 heartbeat: [3671]: WARN:
> G_CH_dispatch_int: Dispatch function for read child was delayed 620 ms
> (> 500 ms) before being called (GSource: 0x982dd68)
> Feb 11 16:49:56 server-0000 heartbeat: [3671]: info:
> G_CH_dispatch_int: started at 429426259 should have started at 429426197
> Feb 11 16:49:55 server-0001 harc[28350]: info: Running
> /etc/ha.d/rc.d/status status
> Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: remote resource
> transition completed.
> Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: remote resource
> transition completed.
> Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: Local Resource
> acquisition completed. (none)
> Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: Initial resource
> acquisition complete (T_RESOURCES(them))
> Feb 11 16:49:56 server-0001 heartbeat: [3483]: info: remote resource
> transition completed.
> Feb 11 16:49:56 server-0001 heartbeat: [3483]: info: all clients are
> now resumed
>
>