[DRBD-user] Stale NFS file handle vs. NFS-Server-README.txt

Wed Jun 9 09:38:50 CEST 2004

Jens,
I had similar problems in a different HA environment (no heartbeat!),
however the mechanisms should be the same. In this first approach I used
ifconfig to control the IP address and exportfs to control the nfs export.
Removing the export immediately after the ifconfig to remove the IP address
returns, leads to these "Stale NFS file handle problems". I three second
delay solved this but significantly slows down the failover time.
This somehow points to an asynchronous behavior of ifconfig, meaning that
there is some traffic on this IP address even the ifconfig already shutdown
the interface.
Following some "google"-hints I added an additional step by directly
blocking the nfs-port using iptables. The procedure for a graceful failover
then follows these steps:
 NodeA: ifconfig down ...
        iptables -D --dport 2049 ...
        exportfs -u ...
 NodeB: exportfs ...
        iptables -A --dport 2049 ...
        ifconfig up ...
Using these steps it works without delay.

/Wolfram


> -----Original Message-----
> From: Jens Dreger [mailto:jens.dreger at physik.fu-berlin.de]
> Sent: Dienstag, 8. Juni 2004 22:10
> To: drbd-user at linbit.com
> Subject: [DRBD-user] Stale NFS file handle vs. NFS-Server-README.txt
> 
> 
> Hi!
> 
> I'm trying to set up a drbd+heartbeat NFS-server. Most things 
> work fine,
> but if I write to the NFS storage during failover, I get a Stale NFS
> file handle error:
> 
> root:~> cp /tmp/large_file /mnt
> cp: writing `/mnt/large_file': Stale NFS file handle
> cp: closing `/mnt/large_file': Stale NFS file handle
> 
> I can get rid of this error, if I insert a small amount of time before
> taking over the IP: 
> 
> [/etc/ha.d/haresources]
>   node1 datadisk::drbd0 nfs-kernel-server nfs-common \
>       wait_n_seconds::5 IPaddr::160.45.32.173
> 
> (wait_n_seconds::5 just sleeps for 5 seconds). Putting the ip in front
> as suggested in 
> http://www.slackworks.com/~dkrovich/DRBD/heartbeat.html
> doesn't work at all.
> 
> drbd/documentation/NFS-Server-README.txt suggests to remove any
> "exportfs -au" from nfs init-scripts. But that has the effect of
> heartbeat no longer being able to unmount the drbd device on failover,
> followed by a reboot of the primary node followed by a re-sync.
> 
>   node1 datadisk: ===> datadisk drbd0 stop <===
>   node1 datadisk: 'drbd0' /dev/nbd/0 is mounted on /drbd/0, 
> trying to unmount
>   node1 datadisk: 'drbd0' trying to kill users of /dev/nbd/0
>   node1 datadisk: fuser -k -m /dev/nbd/0
>   node1 datadisk: umount -v /dev/nbd/0
>   node1 heartbeat: CRIT: Resource STOP failure. Reboot required!
>   node1 heartbeat: CRIT: Killing heartbeat ungracefully!
> 
> This behaviour can be reproduced by:
> 
>   root:~> mount /dev/hda3 /mountpoint
>   root:~> exportfs -vi node1:/mountpoint
>   exporting node1.physik.fu-berlin.de:/mountpoint
>   exporting node1.physik.fu-berlin.de:/mountpoint to kernel
>   root:~> umount /mountpoint
>   umount: /mountpoint: device is busy
>   umount: /mountpoint: device is busy
>   root:~> fuser -k -m /mountpoint
>   root:~>				[NO OUTPUT]
> 
> After issuing an "exportfs -au" the filesystem can be unmounted:
> 
>   root:~> exportfs -au
>   root:~> umount /mountpoint
>   root:~>				[WORKS]
> 
> Thus I can not understand, how the advice given in
> NFS-Server-README.txt could have worked.
> 
> /var/lib/nfs is on the shared device and as I said, everything works
> fine (no data corruption whatsoever), iff I insert the small delay.
> So this is not a big problem, but I would like to understand, why
> noone else seems to have this problem.
> 
> Using:
> 	drbd-0.6.12
> 	heartbeat-1.2.2
> 	kernel 2.4.26
> 	debian woody
> 
> Any help is greatly appreciated,
> 
> Jens.
> 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>