Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, at my site we have an ha-nfs server with drbd+heartbeat that recently "failed to fail-over" to the secondary node. All the tests prior to production and after this error didn't have the same behaviour. What we guess had happened is that for some reason heartbeat detected and initiated the fail-over process before drbd could go on a WFconnection state. From what we've understood drbd should by default detect a failure within 6 to 16 seconds, but in our case it took around 30 seconds (16:43:11 -> 16:43:41 there is some seconds of delay considering remote syslogging). Any comments, questions and enlightening are obviously more than appreciated. cheers, Massimo os: rhel-4.6 kernel: 2.6.9-55.0.2.ELsmp drbd: 0.7.24 heartbeat: 2.1.2-3 server-0000: eth0: 10.5.1.2/16 eth1: 192.168.0.1/24 server-0001: eth0: 10.5.1.3/16 eth1: 192.168.0.2/24 fail-over IP: 10.5.2.1/16 (fs-cluster) drbd.conf: resource drbd-resource-0 { protocol C; incon-degr-cmd "killall heartbeat"; startup { wfc-timeout 120; degr-wfc-timeout 120; # 2 minutes. } disk { on-io-error panic; } syncer { rate 10M; } on server-0000 { device /dev/drbd0; disk /dev/vg00/nfs; address 10.5.1.2:7789; meta-disk internal; } on server-0001 { device /dev/drbd0; disk /dev/vg00/nfs; address 10.5.1.3:7789; meta-disk internal; } } # drbdsetup /dev/drbd0 show Lower device: 253:04 (dm-4) Meta device: internal Disk options: on-io-error = panic Local address: 10.5.1.2:7789 Remote address: 10.5.1.3:7789 Wire protocol: C Net options: timeout = 6.0 sec (default) connect-int = 10 sec (default) ping-int = 10 sec (default) max-epoch-size = 2048 (default) max-buffers = 2048 (default) unplug-watermark = 128 (default) sndbuf-size = 131070 (default) ko-count = 0 (default) Syncer options: rate = 10240 KB/sec group = 0 (default) al-extents = 127 (default) ha.cf: logfacility local0 keepalive 2 deadtime 30 warntime 10 initdead 120 udpport 694 ucast eth0 10.5.1.2 ucast eth1 192.168.0.1 ucast eth0 10.5.1.3 ucast eth1 192.168.0.2 auto_failback off watchdog /dev/watchdog node server-0000 node server-0001 debug 0 use_logd no haresources: server-0000 IPaddr::10.5.2.1/16/eth0 drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/nfs::ext3 killnfsd nfs nfslock /var/log/messages: Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: Daily informational memory statistics Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: MSG stats: 10/24560866 ms age 0 [pid3586/MST_CONTROL] Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: cl_malloc stats: 821/803285302 141060/71222 [pid3586/MST_CONTROL] Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: RealMalloc stats: 168756 total malloc bytes. pid [3586/MST_CONTROL] Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: Current arena value: 0 Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: MSG stats: 0/1 ms age 766170962 [pid3626/HBFIFO] Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: cl_malloc stats: 447/512 46732/24700 [pid3626/HBFIFO] Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: RealMalloc stats: 47972 total malloc bytes. pid [3626/HBFIFO] Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: Current arena value: 0 Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: MSG stats: 0/0 ms age 765428326 [pid3631/HBWRITE] Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: cl_malloc stats: 451/8668904 47580/25428 [pid3631/HBWRITE] Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: RealMalloc stats: 56032 total malloc bytes. pid [3631/HBWRITE] Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: Current arena value: 0 Feb 11 16:43:11 server-0000 heartbeat: [3586]: info: MSG stats: 0/0 ms age 765432156 [pid3632/HBREAD] [...server-0000 dies...] Feb 11 16:43:27 server-0001 heartbeat: [3483]: info: Link server-0000:eth0 dead. Feb 11 16:43:27 server-0001 heartbeat: [3483]: info: Link server-0000:eth1 dead. Feb 11 16:43:29 server-0001 heartbeat: [3483]: WARN: node server-0000: is dead Feb 11 16:43:29 server-0001 heartbeat: [3483]: WARN: No STONITH device configured. Feb 11 16:43:29 server-0001 heartbeat: [3483]: WARN: Shared disks are not protected. Feb 11 16:43:29 server-0001 heartbeat: [3483]: info: Resources being acquired from server-0000. Feb 11 16:43:29 server-0001 harc[23487]: info: Running /etc/ha.d/rc.d/status status Feb 11 16:43:29 server-0001 heartbeat: [23488]: info: No local resources [/usr/share/heartbeat/ResourceManager listkeys server-0001] to acquire. Feb 11 16:43:29 server-0001 mach_down[23516]: info: Taking over resource group IPaddr::10.5.2.1/16/eth0 Feb 11 16:43:29 server-0001 ResourceManager[23542]: info: Acquiring resource group: server-0000 IPaddr::10.5.2.1/16/eth0 drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/nfs::ext3 killnfsd nfs nfslock Feb 11 16:43:29 server-0001 IPaddr[23569]: INFO: Resource is stopped Feb 11 16:43:29 server-0001 ResourceManager[23542]: info: Running /etc/ha.d/resource.d/IPaddr 10.5.2.1/16/eth0 start Feb 11 16:43:29 server-0001 IPaddr[23667]: INFO: Using calculated netmask for 10.5.2.1: 255.255.0.0 Feb 11 16:43:29 server-0001 IPaddr[23667]: INFO: eval ifconfig eth0:1 10.5.2.1 netmask 255.255.0.0 broadcast 10.5.255.255 Feb 11 16:43:29 server-0001 IPaddr[23638]: INFO: Success Feb 11 16:43:29 server-0001 ResourceManager[23542]: info: Running /etc/ha.d/resource.d/drbddisk drbd-resource-0 start Feb 11 16:43:35 server-0001 ResourceManager[23542]: ERROR: Return code 20 from /etc/ha.d/resource.d/drbddisk Feb 11 16:43:35 server-0001 ResourceManager[23542]: CRIT: Giving up resources due to failure of drbddisk::drbd-resource-0 Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Releasing resource group: server-0000 IPaddr::10.5.2.1/16/eth0 drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/nfs::ext3 killnfsd nfs nfslock Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running /etc/init.d/nfslock stop Feb 11 16:43:35 server-0001 nfslock: rpc.statd shutdown failed Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running /etc/init.d/nfs stop Feb 11 16:43:35 server-0001 nfs: rpc.mountd shutdown failed Feb 11 16:43:35 server-0001 nfs: nfsd shutdown failed Feb 11 16:43:35 server-0001 nfs: rpc.rquotad shutdown failed Feb 11 16:43:35 server-0001 exportfs[23925]: could not open /var/lib/nfs/etab for locking Feb 11 16:43:35 server-0001 exportfs[23925]: can't lock /var/lib/nfs/etab for writing Feb 11 16:43:35 server-0001 exportfs: exportfs: could not open /var/lib/nfs/etab for locking Feb 11 16:43:35 server-0001 exportfs: exportfs: can't lock /var/lib/nfs/etab for writing Feb 11 16:43:35 server-0001 nfs: Shutting down NFS services: succeeded Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running /etc/ha.d/resource.d/killnfsd stop Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /nfs ext3 stop Feb 11 16:43:35 server-0001 Filesystem[23984]: INFO: Running stop for /dev/drbd0 on /nfs Feb 11 16:43:35 server-0001 Filesystem[23973]: INFO: Success Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running /etc/ha.d/resource.d/drbddisk drbd-resource-0 stop Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running /etc/ha.d/resource.d/IPaddr 10.5.2.1/16/eth0 stop Feb 11 16:43:35 server-0001 IPaddr[24111]: INFO: ifconfig eth0:1 down Feb 11 16:43:35 server-0001 IPaddr[24082]: INFO: Success Feb 11 16:43:35 server-0001 mach_down[23516]: info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired Feb 11 16:43:35 server-0001 mach_down[23516]: info: mach_down takeover complete for node server-0000. Feb 11 16:43:35 server-0001 heartbeat: [3483]: info: mach_down takeover complete. Feb 11 16:43:36 server-0004 kernel: nfs: server fs-cluster not responding, still trying Feb 11 16:43:41 server-0001 kernel: drbd0: PingAck did not arrive in time. Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_asender [27907]: cstate Connected --> NetworkFailure Feb 11 16:43:41 server-0001 kernel: drbd0: asender terminated Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_receiver [3339]: cstate NetworkFailure --> BrokenPipe Feb 11 16:43:41 server-0001 kernel: drbd0: short read expecting header on sock: r=-512 Feb 11 16:43:41 server-0001 kernel: drbd0: worker terminated Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_receiver [3339]: cstate BrokenPipe --> Unconnected Feb 11 16:43:41 server-0001 kernel: drbd0: Connection lost. Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_receiver [3339]: cstate Unconnected --> WFConnection Feb 11 16:44:05 server-0001 hb_standby[24463]: Going standby [foreign]. Feb 11 16:44:06 server-0001 heartbeat: [3483]: info: server-0001 wants to go standby [foreign] Feb 11 16:44:16 server-0001 heartbeat: [3483]: WARN: No reply to standby request. Standby request cancelled. [...server-0000 restarts...] Feb 11 16:49:23 server-0000 syslogd 1.4.1: restart (remote reception). Feb 11 16:49:23 server-0000 syslog: syslogd startup succeeded Feb 11 16:49:23 server-0000 kernel: klogd 1.4.1, log source = /proc/kmsg started. Feb 11 16:49:23 server-0000 kernel: Linux version 2.6.9-55.0.2.ELsmp (brewbuilder at hs20-bc2-3.build.redhat.com) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-8)) #1 SMP Tue Jun 12 17:59:08 EDT 2007 [...] Feb 11 16:49:46 server-0000 network: Setting network parameters: succeeded Feb 11 16:49:46 server-0000 network: Bringing up loopback interface: succeeded Feb 11 16:49:46 server-0000 network: Bringing up interface eth0: succeeded Feb 11 16:49:46 server-0000 drbd: [ Feb 11 16:49:47 server-0000 network: Bringing up interface eth1: succeeded Feb 11 16:49:47 server-0000 kernel: drbd0: resync bitmap: bits=26181632 words=818176 Feb 11 16:49:47 server-0000 drbd: d0 Feb 11 16:49:47 server-0000 kernel: drbd0: size = 99 GB (104726528 KB) Feb 11 16:49:48 server-0000 kernel: drbd0: 0 KB marked out-of-sync by on disk bit-map. Feb 11 16:49:48 server-0000 kernel: drbd0: Found 4 transactions (192 active extents) in activity log. Feb 11 16:49:48 server-0000 kernel: drbd0: Marked additional 508 MB as out-of-sync based on AL. Feb 11 16:49:48 server-0000 kernel: drbd0: drbdsetup [3508]: cstate Unconfigured --> StandAlone Feb 11 16:49:48 server-0000 drbd: s0 Feb 11 16:49:48 server-0000 drbd: n0 Feb 11 16:49:48 server-0000 kernel: drbd0: drbdsetup [3521]: cstate StandAlone --> Unconnected Feb 11 16:49:48 server-0000 kernel: drbd0: drbd0_receiver [3522]: cstate Unconnected --> WFConnection Feb 11 16:49:48 server-0000 drbd: ]. Feb 11 16:49:48 server-0000 drbd: WARN: stdin/stdout is not a TTY; using /dev/console Feb 11 16:49:51 server-0001 kernel: drbd0: drbd0_receiver [3339]: cstate WFConnection --> WFReportParams Feb 11 16:49:51 server-0001 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Feb 11 16:49:51 server-0000 kernel: drbd0: drbd0_receiver [3522]: cstate WFConnection --> WFReportParams Feb 11 16:49:51 server-0001 kernel: drbd0: Connection established. Feb 11 16:49:51 server-0000 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Feb 11 16:49:51 server-0001 kernel: drbd0: I am(S): 1:00000002:00000001:00000043:00000001:01 Feb 11 16:49:51 server-0000 kernel: drbd0: Connection established. Feb 11 16:49:51 server-0001 kernel: drbd0: Peer(S): 1:00000002:00000001:00000043:00000001:11 Feb 11 16:49:51 server-0001 kernel: drbd0: drbd0_receiver [3339]: cstate WFReportParams --> WFBitMapT Feb 11 16:49:51 server-0000 kernel: drbd0: I am(S): 1:00000002:00000001:00000043:00000001:11 Feb 11 16:49:51 server-0001 kernel: drbd0: Secondary/Unknown --> Secondary/Secondary Feb 11 16:49:51 server-0001 kernel: drbd0: drbd0_receiver [3339]: cstate WFBitMapT --> SyncTarget Feb 11 16:49:51 server-0000 kernel: drbd0: Peer(S): 1:00000002:00000001:00000043:00000001:01 Feb 11 16:49:51 server-0001 kernel: drbd0: Resync started as SyncTarget (need to sync 520192 KB [130048 bits set]). Feb 11 16:49:51 server-0000 kernel: drbd0: drbd0_receiver [3522]: cstate WFReportParams --> WFBitMapS Feb 11 16:49:51 server-0000 rc: Starting drbd: succeeded Feb 11 16:49:51 server-0000 kernel: drbd0: Secondary/Unknown --> Secondary/Secondary Feb 11 16:49:51 server-0000 kernel: drbd0: drbd0_receiver [3522]: cstate WFBitMapS --> SyncSource Feb 11 16:49:51 server-0000 kernel: drbd0: Resync started as SyncSource (need to sync 520192 KB [130048 bits set]). Feb 11 16:49:51 server-0000 logd: [3547]: info: logd started with default configuration. Feb 11 16:49:52 server-0000 logd: [3555]: info: G_main_add_SignalHandler: Added signal handler for signal 15 Feb 11 16:49:52 server-0000 logd: [3547]: info: G_main_add_SignalHandler: Added signal handler for signal 15 Feb 11 16:49:52 server-0000 kernel: Software Watchdog Timer: 0.07 initialized. soft_noboot=0 soft_margin=60 sec (nowayout= 0) Feb 11 16:49:52 server-0000 heartbeat: [3668]: WARN: Logging daemon is disabled --enabling logging daemon is recommended Feb 11 16:49:52 server-0000 heartbeat: [3668]: info: Version 2 support: false Feb 11 16:49:52 server-0000 heartbeat: [3668]: WARN: Logging daemon is disabled --enabling logging daemon is recommended Feb 11 16:49:52 server-0000 heartbeat: [3668]: info: ************************** Feb 11 16:49:52 server-0000 heartbeat: [3668]: info: Configuration validated. Starting heartbeat 2.1.2 Feb 11 16:49:52 server-0000 heartbeat: [3671]: info: heartbeat: version 2.1.2 Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: Heartbeat generation: 1187008426 Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: G_main_add_TriggerHandler: Added signal manual handler Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: G_main_add_TriggerHandler: Added signal manual handler Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: Removing /var/run/heartbeat/rsctmp failed, recreating. Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0 Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: bound send socket to device: eth0 Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: bound receive socket to device: eth0 Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: started on port 694 interface eth0 to 10.5.1.2 Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1 Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: bound send socket to device: eth1 Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: bound receive socket to device: eth1 Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: started on port 694 interface eth1 to 192.168.0.1 Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth0 Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: bound send socket to device: eth0 Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: bound receive socket to device: eth0 Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast: started on port 694 interface eth0 to 10.5.1.3 Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1 Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast: bound send socket to device: eth1 Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast: bound receive socket to device: eth1 Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Heartbeat restart on node server-0000 Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Link server-0000:eth0 up. Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast: started on port 694 interface eth1 to 192.168.0.2 Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Status update for node server-0000: status init Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Status update for node server-0000: status up Feb 11 16:49:54 server-0000 heartbeat: [3671]: notice: Using watchdog device: /dev/watchdog Feb 11 16:49:54 server-0001 harc[28317]: info: Running /etc/ha.d/rc.d/status status Feb 11 16:49:54 server-0001 harc[28334]: info: Running /etc/ha.d/rc.d/status status Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: all clients are now paused Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Link server-0000:eth1 up. Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Local status now set to: 'up' Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Link server-0001:eth0 up. Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Status update for node server-0001: status active Feb 11 16:49:55 server-0000 harc[3772]: info: Running /etc/ha.d/rc.d/status status Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Link server-0001:eth1 up. Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Comm_now_up(): updating status to active Feb 11 16:49:55 server-0001 heartbeat: [3483]: info: Status update for node server-0000: status active Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: Local status now set to: 'active' Feb 11 16:49:56 server-0000 heartbeat: [3671]: WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 620 ms (> 50 ms) (GSource: 0x982dc38) Feb 11 16:49:56 server-0000 heartbeat: [3671]: WARN: G_CH_dispatch_int: Dispatch function for read child was delayed 620 ms (> 500 ms) before being called (GSource: 0x982dd68) Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: G_CH_dispatch_int: started at 429426259 should have started at 429426197 Feb 11 16:49:55 server-0001 harc[28350]: info: Running /etc/ha.d/rc.d/status status Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: remote resource transition completed. Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: remote resource transition completed. Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: Local Resource acquisition completed. (none) Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: Initial resource acquisition complete (T_RESOURCES(them)) Feb 11 16:49:56 server-0001 heartbeat: [3483]: info: remote resource transition completed. Feb 11 16:49:56 server-0001 heartbeat: [3483]: info: all clients are now resumed