[DRBD-user] Apps cannot write for 15 seconds during failover

Thu Jan 3 00:40:14 CET 2008

On Wed, Jan 02, 2008 at 03:41:27PM +0000, Michael Toler wrote:
> (I buried this question in another post before the holidays and it didn't seem
> to get any response, so I'm trying again.
> 
> System:
>        2  IBM Xeon blades as NFS/DRBD log servers setup as 
>           Primary/Secondary.
>       20+ Blades running various packages (JBoss, Jabber, 
>           Postgres, etc) logging 
>           to the NFS log server.
>       All blades running RH 4 ES.
> 
> Right now I have all of my processes connecting to the Primary 
> side of a my DRBD setup.  I'm using this setup as a log server for 
> processes on multiple blades on a blade server.  
> 
> When I do the HA switchover, my processes lock for about 15 seconds 
> (when using soft mounted NFS directories) before continuing to write 
> to the files their log files on the secondary server.    I had thought
>  that since I moved the /var/lib/nfs directory to my drbd shared 
> directory that the failover would be pretty much instantaneous.  
> Do I have a problem in my setup somewhere or is 15 seconds for 
> failover fairly standard?

your problem is most likely not the switchover time of drbd
(that should be virtually instaneous in such a setup),
not that of the nfs server.
but either an ip switchover side effect,
something like arp table update timeout on the clients,
or "mac spoofing protection" on the switch.

or due to some nfs retry timeout magic.
which then again depends on the nfs versions in use.

btw, "soft" nfs mounts is a _really_ bad idea, imho.
nice summary descriptions of the reasons can be found in
http://nfs.sourceforge.net/#faq_e4	and
http://unix.derkeiler.com/Newsgroups/comp.unix.solaris/2005-10/1566.html

I suggest to use hard,intr (if you really want the intr, that is).
and I suggest playing with timeo and retrans parameters.
also see if switching from/to tcp/udp helps to meet your expectations.

for "normal nfs failover" btw, 15 seconds is not that bad.

with proper configuration and hardware, readily achievable *switchover*
times with nfs on drbd should be <= 5 seconds. depending on actual io
load, vfs tuning, amount of dirty pages at the exact time of the
switchover (influencing time to umount), and possibly other tuning,
maybe even less.

*FAILOVER* is a different matter, since a failover is only triggered
after your cluster manager decides that a node failure has happened,
which is normally done when cluster heartbeats cease to be answered for
a certain deadtime timeout, typically in the range of 10 to 30 seconds.
failover actions -- potentially stonith, starting of services, necessary
service data recovery like journal replays... -- are only taken after
this deadtime, so any *failover* time will be noticeably larger than
any *switchover* time, where services are intentionally and cleanly
stopped/switched/started.

switchover time:
   time to stop services
   	unconfig ip (instantaneous),
	unconfig nfs (instantaneous (?))
	umount fs (few ms to some seconds),
	drbd secondary (normally almost instantaneous,
		under certain circumstances a few seconds)
 + time to start services
 	drbd primary (instantaneous in this scenario),
	mount fs (few ms in this scenario)
	config nfs (instantaneous)
	config ip
 + time for clients to notice the ip change and "reconnect"

failover time:
   time to notice node failure (typically several seconds)
 + optionally time for any fencing operations
 + time to start services
      the same as above,
       + recovery time during mount,
         which is very hard to predict,
         but depends on fs type and size

(failover time) - (switchover time)
	- time to stop services
		which should be negligible in this case
	+ time to notice node failure
		which is configurable, but several seconds
	+ time to do fencing
	+ time to do data recovery
		both durations are difficult to guess

so my guess is that for *failover*,
you have to look into your cluster managers deadtime configuration,
and also tune the various other involved timeouts in your cluster
manager and in drbd to be as short as possible but as long as necessary.

But don't fall into the trap and configure timeouts
as short as you would wish them to be!  :)
or you will see spurious "node dead, trying to takeover" events,
which is very annoying at best.

-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please use the "List-Reply" function of your email client.