Jens Dreger
Tue Jun 8 22:10:02 CEST 2004

I'm trying to set up a drbd+heartbeat NFS-server. Most things work fine,
but if I write to the NFS storage during failover, I get a Stale NFS
file handle error:

root:~> cp /tmp/large_file /mnt
cp: writing `/mnt/large_file': Stale NFS file handle
cp: closing `/mnt/large_file': Stale NFS file handle

I can get rid of this error, if I insert a small amount of time before
taking over the IP: 

  node1 datadisk::drbd0 nfs-kernel-server nfs-common \
      wait_n_seconds::5 IPaddr::

(wait_n_seconds::5 just sleeps for 5 seconds). Putting the ip in front
as suggested in http://www.slackworks.com/~dkrovich/DRBD/heartbeat.html
doesn't work at all.

drbd/documentation/NFS-Server-README.txt suggests to remove any
"exportfs -au" from nfs init-scripts. But that has the effect of
heartbeat no longer being able to unmount the drbd device on failover,
followed by a reboot of the primary node followed by a re-sync.

  node1 datadisk: ===> datadisk drbd0 stop <===
  node1 datadisk: 'drbd0' /dev/nbd/0 is mounted on /drbd/0, trying to unmount
  node1 datadisk: 'drbd0' trying to kill users of /dev/nbd/0
  node1 datadisk: fuser -k -m /dev/nbd/0
  node1 datadisk: umount -v /dev/nbd/0
  node1 heartbeat: CRIT: Resource STOP failure. Reboot required!
  node1 heartbeat: CRIT: Killing heartbeat ungracefully!

This behaviour can be reproduced by:

  root:~> mount /dev/hda3 /mountpoint
  root:~> exportfs -vi node1:/mountpoint
  exporting node1.physik.fu-berlin.de:/mountpoint
  exporting node1.physik.fu-berlin.de:/mountpoint to kernel
  root:~> umount /mountpoint
  umount: /mountpoint: device is busy
  umount: /mountpoint: device is busy
  root:~> fuser -k -m /mountpoint
  root:~>				[NO OUTPUT]

After issuing an "exportfs -au" the filesystem can be unmounted:

  root:~> exportfs -au
  root:~> umount /mountpoint
  root:~>				[WORKS]

Thus I can not understand, how the advice given in
NFS-Server-README.txt could have worked.

/var/lib/nfs is on the shared device and as I said, everything works
fine (no data corruption whatsoever), iff I insert the small delay.
So this is not a big problem, but I would like to understand, why
noone else seems to have this problem.

	kernel 2.4.26
	debian woody

Any help is greatly appreciated,


