Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Anyone so kind to have a look at this? cheers Massimo Massimo Mongardini wrote: > Hi, at my site we have an ha-nfs server with drbd+heartbeat that > recently "failed to fail-over" to the secondary node. All the tests > prior to production and after this error didn't have the same behaviour. > What we guess had happened is that for some reason heartbeat > detected and initiated the fail-over process before drbd could go on a > WFconnection state. > From what we've understood drbd should by default detect a failure > within 6 to 16 seconds, but in our case it took around 30 seconds > (16:43:11 -> 16:43:41 there is some seconds of delay considering > remote syslogging). > Any comments, questions and enlightening are obviously more than > appreciated. > > cheers, > Massimo > > os: rhel-4.6 > kernel: 2.6.9-55.0.2.ELsmp > drbd: 0.7.24 > heartbeat: 2.1.2-3 > > server-0000: > eth0: 10.5.1.2/16 > eth1: 192.168.0.1/24 > > server-0001: > eth0: 10.5.1.3/16 > eth1: 192.168.0.2/24 > > fail-over IP: 10.5.2.1/16 (fs-cluster) > > drbd.conf: > > resource drbd-resource-0 { > protocol C; > incon-degr-cmd "killall heartbeat"; > startup { > wfc-timeout 120; > degr-wfc-timeout 120; # 2 minutes. > } > disk { > on-io-error panic; > } > syncer { > rate 10M; > } > on server-0000 { > device /dev/drbd0; > disk /dev/vg00/nfs; > address 10.5.1.2:7789; meta-disk internal; > } > on server-0001 { > device /dev/drbd0; > disk /dev/vg00/nfs; > address 10.5.1.3:7789; > meta-disk internal; > } > } > > # drbdsetup /dev/drbd0 show > Lower device: 253:04 (dm-4) > Meta device: internal > Disk options: > on-io-error = panic > Local address: 10.5.1.2:7789 > Remote address: 10.5.1.3:7789 > Wire protocol: C > Net options: > timeout = 6.0 sec (default) > connect-int = 10 sec (default) > ping-int = 10 sec (default) > max-epoch-size = 2048 (default) > max-buffers = 2048 (default) > unplug-watermark = 128 (default) > sndbuf-size = 131070 (default) > ko-count = 0 (default) > Syncer options: > rate = 10240 KB/sec > group = 0 (default) > al-extents = 127 (default) > > > ha.cf: > > logfacility local0 > keepalive 2 > deadtime 30 > warntime 10 > initdead 120 > udpport 694 > ucast eth0 10.5.1.2 > ucast eth1 192.168.0.1 > ucast eth0 10.5.1.3 > ucast eth1 192.168.0.2 > auto_failback off > watchdog /dev/watchdog > node server-0000 > node server-0001 > debug 0 > use_logd no > > haresources: > > server-0000 IPaddr::10.5.2.1/16/eth0 drbddisk::drbd-resource-0 > Filesystem::/dev/drbd0::/nfs::ext3 killnfsd nfs nfslock > > > /var/log/messages: > > Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: Daily > informational memory statistics > Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: MSG stats: > 10/24560866 ms age 0 [pid3586/MST_CONTROL] > Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: cl_malloc stats: > 821/803285302 141060/71222 [pid3586/MST_CONTROL] > Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: RealMalloc stats: > 168756 total malloc bytes. pid [3586/MST_CONTROL] > Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: Current arena > value: 0 > Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: MSG stats: 0/1 ms > age 766170962 [pid3626/HBFIFO] > Feb 11 16:43:01 server-0000 heartbeat: [3586]: info: cl_malloc stats: > 447/512 46732/24700 [pid3626/HBFIFO] > Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: RealMalloc stats: > 47972 total malloc bytes. pid [3626/HBFIFO] > Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: Current arena > value: 0 > Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: MSG stats: 0/0 ms > age 765428326 [pid3631/HBWRITE] > Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: cl_malloc stats: > 451/8668904 47580/25428 [pid3631/HBWRITE] > Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: RealMalloc stats: > 56032 total malloc bytes. pid [3631/HBWRITE] > Feb 11 16:43:06 server-0000 heartbeat: [3586]: info: Current arena > value: 0 > Feb 11 16:43:11 server-0000 heartbeat: [3586]: info: MSG stats: 0/0 ms > age 765432156 [pid3632/HBREAD] > > [...server-0000 dies...] > > Feb 11 16:43:27 server-0001 heartbeat: [3483]: info: Link > server-0000:eth0 dead. > Feb 11 16:43:27 server-0001 heartbeat: [3483]: info: Link > server-0000:eth1 dead. > Feb 11 16:43:29 server-0001 heartbeat: [3483]: WARN: node server-0000: > is dead > Feb 11 16:43:29 server-0001 heartbeat: [3483]: WARN: No STONITH device > configured. > Feb 11 16:43:29 server-0001 heartbeat: [3483]: WARN: Shared disks are > not protected. > Feb 11 16:43:29 server-0001 heartbeat: [3483]: info: Resources being > acquired from server-0000. > Feb 11 16:43:29 server-0001 harc[23487]: info: Running > /etc/ha.d/rc.d/status status > Feb 11 16:43:29 server-0001 heartbeat: [23488]: info: No local > resources [/usr/share/heartbeat/ResourceManager listkeys server-0001] > to acquire. > Feb 11 16:43:29 server-0001 mach_down[23516]: info: Taking over > resource group IPaddr::10.5.2.1/16/eth0 > Feb 11 16:43:29 server-0001 ResourceManager[23542]: info: Acquiring > resource group: server-0000 IPaddr::10.5.2.1/16/eth0 > drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/nfs::ext3 killnfsd > nfs nfslock > Feb 11 16:43:29 server-0001 IPaddr[23569]: INFO: Resource is stopped > Feb 11 16:43:29 server-0001 ResourceManager[23542]: info: Running > /etc/ha.d/resource.d/IPaddr 10.5.2.1/16/eth0 start > Feb 11 16:43:29 server-0001 IPaddr[23667]: INFO: Using calculated > netmask for 10.5.2.1: 255.255.0.0 > Feb 11 16:43:29 server-0001 IPaddr[23667]: INFO: eval ifconfig eth0:1 > 10.5.2.1 netmask 255.255.0.0 broadcast 10.5.255.255 > Feb 11 16:43:29 server-0001 IPaddr[23638]: INFO: Success > Feb 11 16:43:29 server-0001 ResourceManager[23542]: info: Running > /etc/ha.d/resource.d/drbddisk drbd-resource-0 start > Feb 11 16:43:35 server-0001 ResourceManager[23542]: ERROR: Return code > 20 from /etc/ha.d/resource.d/drbddisk > Feb 11 16:43:35 server-0001 ResourceManager[23542]: CRIT: Giving up > resources due to failure of drbddisk::drbd-resource-0 > Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Releasing > resource group: server-0000 IPaddr::10.5.2.1/16/eth0 > drbddisk::drbd-resource-0 Filesystem::/dev/drbd0::/nfs::ext3 killnfsd > nfs nfslock > Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running > /etc/init.d/nfslock stop > Feb 11 16:43:35 server-0001 nfslock: rpc.statd shutdown failed > Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running > /etc/init.d/nfs stop > Feb 11 16:43:35 server-0001 nfs: rpc.mountd shutdown failed > Feb 11 16:43:35 server-0001 nfs: nfsd shutdown failed > Feb 11 16:43:35 server-0001 nfs: rpc.rquotad shutdown failed > Feb 11 16:43:35 server-0001 exportfs[23925]: could not open > /var/lib/nfs/etab for locking > Feb 11 16:43:35 server-0001 exportfs[23925]: can't lock > /var/lib/nfs/etab for writing > Feb 11 16:43:35 server-0001 exportfs: exportfs: could not open > /var/lib/nfs/etab for locking > Feb 11 16:43:35 server-0001 exportfs: exportfs: can't lock > /var/lib/nfs/etab for writing > Feb 11 16:43:35 server-0001 nfs: Shutting down NFS services: succeeded > Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running > /etc/ha.d/resource.d/killnfsd stop > Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running > /etc/ha.d/resource.d/Filesystem /dev/drbd0 /nfs ext3 stop > Feb 11 16:43:35 server-0001 Filesystem[23984]: INFO: Running stop for > /dev/drbd0 on /nfs > Feb 11 16:43:35 server-0001 Filesystem[23973]: INFO: Success > Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running > /etc/ha.d/resource.d/drbddisk drbd-resource-0 stop > Feb 11 16:43:35 server-0001 ResourceManager[23542]: info: Running > /etc/ha.d/resource.d/IPaddr 10.5.2.1/16/eth0 stop > Feb 11 16:43:35 server-0001 IPaddr[24111]: INFO: ifconfig eth0:1 down > Feb 11 16:43:35 server-0001 IPaddr[24082]: INFO: Success > Feb 11 16:43:35 server-0001 mach_down[23516]: info: > /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired > Feb 11 16:43:35 server-0001 mach_down[23516]: info: mach_down takeover > complete for node server-0000. > Feb 11 16:43:35 server-0001 heartbeat: [3483]: info: mach_down > takeover complete. > Feb 11 16:43:36 server-0004 kernel: nfs: server fs-cluster not > responding, still trying > Feb 11 16:43:41 server-0001 kernel: drbd0: PingAck did not arrive in > time. > Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_asender [27907]: > cstate Connected --> NetworkFailure > Feb 11 16:43:41 server-0001 kernel: drbd0: asender terminated > Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_receiver [3339]: > cstate NetworkFailure --> BrokenPipe > Feb 11 16:43:41 server-0001 kernel: drbd0: short read expecting header > on sock: r=-512 > Feb 11 16:43:41 server-0001 kernel: drbd0: worker terminated > Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_receiver [3339]: > cstate BrokenPipe --> Unconnected > Feb 11 16:43:41 server-0001 kernel: drbd0: Connection lost. > Feb 11 16:43:41 server-0001 kernel: drbd0: drbd0_receiver [3339]: > cstate Unconnected --> WFConnection > Feb 11 16:44:05 server-0001 hb_standby[24463]: Going standby [foreign]. > Feb 11 16:44:06 server-0001 heartbeat: [3483]: info: server-0001 wants > to go standby [foreign] > Feb 11 16:44:16 server-0001 heartbeat: [3483]: WARN: No reply to > standby request. Standby request cancelled. > > [...server-0000 restarts...] > > Feb 11 16:49:23 server-0000 syslogd 1.4.1: restart (remote reception). > Feb 11 16:49:23 server-0000 syslog: syslogd startup succeeded > Feb 11 16:49:23 server-0000 kernel: klogd 1.4.1, log source = > /proc/kmsg started. > Feb 11 16:49:23 server-0000 kernel: Linux version 2.6.9-55.0.2.ELsmp > (brewbuilder at hs20-bc2-3.build.redhat.com) (gcc version 3.4.6 20060404 > (Red Hat 3.4.6-8)) #1 SMP Tue Jun 12 17:59:08 EDT 2007 > > [...] > > Feb 11 16:49:46 server-0000 network: Setting network parameters: > succeeded > Feb 11 16:49:46 server-0000 network: Bringing up loopback interface: > succeeded > Feb 11 16:49:46 server-0000 network: Bringing up interface eth0: > succeeded > Feb 11 16:49:46 server-0000 drbd: [ > Feb 11 16:49:47 server-0000 network: Bringing up interface eth1: > succeeded > Feb 11 16:49:47 server-0000 kernel: drbd0: resync bitmap: > bits=26181632 words=818176 > Feb 11 16:49:47 server-0000 drbd: d0 > Feb 11 16:49:47 server-0000 kernel: drbd0: size = 99 GB (104726528 KB) > Feb 11 16:49:48 server-0000 kernel: drbd0: 0 KB marked out-of-sync by > on disk bit-map. > Feb 11 16:49:48 server-0000 kernel: drbd0: Found 4 transactions (192 > active extents) in activity log. > Feb 11 16:49:48 server-0000 kernel: drbd0: Marked additional 508 MB as > out-of-sync based on AL. > Feb 11 16:49:48 server-0000 kernel: drbd0: drbdsetup [3508]: cstate > Unconfigured --> StandAlone > Feb 11 16:49:48 server-0000 drbd: s0 > Feb 11 16:49:48 server-0000 drbd: n0 > Feb 11 16:49:48 server-0000 kernel: drbd0: drbdsetup [3521]: cstate > StandAlone --> Unconnected > Feb 11 16:49:48 server-0000 kernel: drbd0: drbd0_receiver [3522]: > cstate Unconnected --> WFConnection > Feb 11 16:49:48 server-0000 drbd: ]. > Feb 11 16:49:48 server-0000 drbd: WARN: stdin/stdout is not a TTY; > using /dev/console > Feb 11 16:49:51 server-0001 kernel: drbd0: drbd0_receiver [3339]: > cstate WFConnection --> WFReportParams > Feb 11 16:49:51 server-0001 kernel: drbd0: Handshake successful: DRBD > Network Protocol version 74 > Feb 11 16:49:51 server-0000 kernel: drbd0: drbd0_receiver [3522]: > cstate WFConnection --> WFReportParams > Feb 11 16:49:51 server-0001 kernel: drbd0: Connection established. > Feb 11 16:49:51 server-0000 kernel: drbd0: Handshake successful: DRBD > Network Protocol version 74 > Feb 11 16:49:51 server-0001 kernel: drbd0: I am(S): > 1:00000002:00000001:00000043:00000001:01 > Feb 11 16:49:51 server-0000 kernel: drbd0: Connection established. > Feb 11 16:49:51 server-0001 kernel: drbd0: Peer(S): > 1:00000002:00000001:00000043:00000001:11 > Feb 11 16:49:51 server-0001 kernel: drbd0: drbd0_receiver [3339]: > cstate WFReportParams --> WFBitMapT > Feb 11 16:49:51 server-0000 kernel: drbd0: I am(S): > 1:00000002:00000001:00000043:00000001:11 > Feb 11 16:49:51 server-0001 kernel: drbd0: Secondary/Unknown --> > Secondary/Secondary > Feb 11 16:49:51 server-0001 kernel: drbd0: drbd0_receiver [3339]: > cstate WFBitMapT --> SyncTarget > Feb 11 16:49:51 server-0000 kernel: drbd0: Peer(S): > 1:00000002:00000001:00000043:00000001:01 > Feb 11 16:49:51 server-0001 kernel: drbd0: Resync started as > SyncTarget (need to sync 520192 KB [130048 bits set]). > Feb 11 16:49:51 server-0000 kernel: drbd0: drbd0_receiver [3522]: > cstate WFReportParams --> WFBitMapS > Feb 11 16:49:51 server-0000 rc: Starting drbd: succeeded > Feb 11 16:49:51 server-0000 kernel: drbd0: Secondary/Unknown --> > Secondary/Secondary > Feb 11 16:49:51 server-0000 kernel: drbd0: drbd0_receiver [3522]: > cstate WFBitMapS --> SyncSource > Feb 11 16:49:51 server-0000 kernel: drbd0: Resync started as > SyncSource (need to sync 520192 KB [130048 bits set]). > Feb 11 16:49:51 server-0000 logd: [3547]: info: logd started with > default configuration. > Feb 11 16:49:52 server-0000 logd: [3555]: info: > G_main_add_SignalHandler: Added signal handler for signal 15 > Feb 11 16:49:52 server-0000 logd: [3547]: info: > G_main_add_SignalHandler: Added signal handler for signal 15 > Feb 11 16:49:52 server-0000 kernel: Software Watchdog Timer: 0.07 > initialized. soft_noboot=0 soft_margin=60 sec (nowayout= 0) > Feb 11 16:49:52 server-0000 heartbeat: [3668]: WARN: Logging daemon is > disabled --enabling logging daemon is recommended > Feb 11 16:49:52 server-0000 heartbeat: [3668]: info: Version 2 > support: false > Feb 11 16:49:52 server-0000 heartbeat: [3668]: WARN: Logging daemon is > disabled --enabling logging daemon is recommended > Feb 11 16:49:52 server-0000 heartbeat: [3668]: info: > ************************** > Feb 11 16:49:52 server-0000 heartbeat: [3668]: info: Configuration > validated. Starting heartbeat 2.1.2 > Feb 11 16:49:52 server-0000 heartbeat: [3671]: info: heartbeat: > version 2.1.2 > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: Heartbeat > generation: 1187008426 > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: > G_main_add_TriggerHandler: Added signal manual handler > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: > G_main_add_TriggerHandler: Added signal manual handler > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: Removing > /var/run/heartbeat/rsctmp failed, recreating. > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: > write socket priority set to IPTOS_LOWDELAY on eth0 > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: > bound send socket to device: eth0 > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: > bound receive socket to device: eth0 > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: > started on port 694 interface eth0 to 10.5.1.2 > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: > write socket priority set to IPTOS_LOWDELAY on eth1 > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: > bound send socket to device: eth1 > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: > bound receive socket to device: eth1 > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: > started on port 694 interface eth1 to 192.168.0.1 > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: > write socket priority set to IPTOS_LOWDELAY on eth0 > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: > bound send socket to device: eth0 > Feb 11 16:49:53 server-0000 heartbeat: [3671]: info: glib: ucast: > bound receive socket to device: eth0 > Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast: > started on port 694 interface eth0 to 10.5.1.3 > Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast: > write socket priority set to IPTOS_LOWDELAY on eth1 > Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast: > bound send socket to device: eth1 > Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast: > bound receive socket to device: eth1 > Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Heartbeat restart > on node server-0000 > Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Link > server-0000:eth0 up. > Feb 11 16:49:54 server-0000 heartbeat: [3671]: info: glib: ucast: > started on port 694 interface eth1 to 192.168.0.2 > Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Status update for > node server-0000: status init > Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Status update for > node server-0000: status up > Feb 11 16:49:54 server-0000 heartbeat: [3671]: notice: Using watchdog > device: /dev/watchdog > Feb 11 16:49:54 server-0001 harc[28317]: info: Running > /etc/ha.d/rc.d/status status > Feb 11 16:49:54 server-0001 harc[28334]: info: Running > /etc/ha.d/rc.d/status status > Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: all clients are > now paused > Feb 11 16:49:54 server-0001 heartbeat: [3483]: info: Link > server-0000:eth1 up. > Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: > G_main_add_SignalHandler: Added signal handler for signal 17 > Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Local status now > set to: 'up' > Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Link > server-0001:eth0 up. > Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Status update for > node server-0001: status active > Feb 11 16:49:55 server-0000 harc[3772]: info: Running > /etc/ha.d/rc.d/status status > Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Link > server-0001:eth1 up. > Feb 11 16:49:55 server-0000 heartbeat: [3671]: info: Comm_now_up(): > updating status to active > Feb 11 16:49:55 server-0001 heartbeat: [3483]: info: Status update for > node server-0000: status active > Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: Local status now > set to: 'active' > Feb 11 16:49:56 server-0000 heartbeat: [3671]: WARN: > G_CH_dispatch_int: Dispatch function for read child took too long to > execute: 620 ms (> 50 ms) (GSource: 0x982dc38) > Feb 11 16:49:56 server-0000 heartbeat: [3671]: WARN: > G_CH_dispatch_int: Dispatch function for read child was delayed 620 ms > (> 500 ms) before being called (GSource: 0x982dd68) > Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: > G_CH_dispatch_int: started at 429426259 should have started at 429426197 > Feb 11 16:49:55 server-0001 harc[28350]: info: Running > /etc/ha.d/rc.d/status status > Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: remote resource > transition completed. > Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: remote resource > transition completed. > Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: Local Resource > acquisition completed. (none) > Feb 11 16:49:56 server-0000 heartbeat: [3671]: info: Initial resource > acquisition complete (T_RESOURCES(them)) > Feb 11 16:49:56 server-0001 heartbeat: [3483]: info: remote resource > transition completed. > Feb 11 16:49:56 server-0001 heartbeat: [3483]: info: all clients are > now resumed > >