[DRBD-user] Stale NFS file handle vs. NFS-Server-README.txt

Tue Jun 8 22:58:21 CEST 2004

Hello Jens,

> I'm trying to set up a drbd+heartbeat NFS-server. Most things work fine,
> but if I write to the NFS storage during failover, I get a Stale NFS
> file handle error:
>
> root:~> cp /tmp/large_file /mnt
> cp: writing `/mnt/large_file': Stale NFS file handle
> cp: closing `/mnt/large_file': Stale NFS file handle
>
> I can get rid of this error, if I insert a small amount of time before
> taking over the IP:
>
> [/etc/ha.d/haresources]
>   node1 datadisk::drbd0 nfs-kernel-server nfs-common \
>       wait_n_seconds::5 IPaddr::160.45.32.173
>

[snip]

> /var/lib/nfs is on the shared device and as I said, everything works
> fine (no data corruption whatsoever), iff I insert the small delay.
> So this is not a big problem, but I would like to understand, why
> noone else seems to have this problem.
>

I had the same problem, but solved it in another way. Actually I think this is 
not a drbd issue at all, but more a heartbeat/nfs/debian problem, so I 
reported it about 2 or 3 weeks ago to the heartbeat and nfs ML and to the 
debian nfs-maintainer, unfortunality without any answer.
I still not understand whats causing this, but I'm pretty sure that the debian 
nfs-kernel-server script cannot stop the nfs-server when nfs is started from 
heartbeat. Just check it yourself, after running '/etc/init.d/heartbeat 
stop', a 'ps ax' should show running nfs-daemons on this system. Those nfsd 
processes can only be killed with 'killall -9 nfsd'. So I think after 
stopping nfs, the nfs-daemons will survive and cause a stale filehandle when  
drbd is stopped, probably they also ignore the 'exportfs -au' command.
I'm also still wondering about the way the debian script is stopping nfs, I 
checked the script of several distributions and either nfsd's are immediately 
stopped with signal 9 or first with signal 15 and after a short break with 
signal 9, however the debian script only stops the daemons with signal 1.

Currently our server is down (trouble with the transtec 5008 ide/scsi raid 
system), so I can't send you my modified nfs-kernel-server script, but I 
think I only added a sleep and nfsd stop signal 9 after the first stop 
signal, so something like this:

...
        start-stop-daemon --stop --oknodo --quiet \
            --name nfsd --user 0 --signal 2

	sleep 5

        start-stop-daemon --stop --oknodo --quiet \
            --name nfsd --user 0 --signal 9

However, now I don't understand why it works if you add your 5s break...
Could you please confirm the nfs-stop problem?

Cheers,
	Bernd
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20040608/fe9dc3dd/attachment.pgp>