[DRBD-user] Heartbeat Filesystem resource agent (on top of DRBD RA) unmounting failure
lars.ellenberg at linbit.com
Thu Dec 27 20:43:14 CET 2007
On Thu, Dec 27, 2007 at 12:29:56PM -0000, Rodrigo Pereira wrote:
> My cluster had a hiccup today. Primary node was manually soft-rebooted by
> someone, and DRBD on secondary node was on a loop trying to start.
> I browsed the Filesystem RA, and i see it uses fuser to try and remove
> processes attached to the fs. Obviously this didn't work, as fuser does not
> return 0. So i ask, what kind of known or hipothetical situation could
> originate this problem? I believe i had a bash planted on the fs, but that
> souldn't be a problem for fuser -k. I'm also sending this to DRBD list,
> maybe it's something just DRBD related.
> Logs showed this when the cluster was shutting down on the primary node:
> Filesystem: 2007/12/27_09:52:27 INFO: Running stop for
> /dev/drbd0 on /drbd0
> Filesystem: 2007/12/27_09:52:27 INFO: Trying to unmount /drbd0
> lrmd: 2007/12/27_09:52:27 info: RA output: (fs0:stop:stderr) umount:
> /drbd0: device is busy
> umount: /drbd0: device is busy
> [... trying to umount several times with fuser and SIGTERM/KILL signals ...]
> Filesystem: 2007/12/27_09:52:32 ERROR: Couldn't unmount /drbd0;
> trying cleanup with SIGKILL
> Filesystem: 2007/12/27_09:52:32 INFO: No processes on /drbd0
> were signalled
> Filesystem: 2007/12/27_09:52:33 ERROR: Couldn't unmount /drbd0,
> giving up!
> lrmd: 2007/12/27_09:52:34 WARN: Exiting fs0:stop process 10970
> returned rc 1.
> crmd: 2007/12/27_09:52:34 ERROR: process_lrm_event: LRM operation
> fs0_stop_0 (call=83, rc=1) Error unknown error
unlikely, but possible:
some strangely misconfigured udev stuff
fast respawning processes which use the fs on drbd
some devicemapper stuff (device mappings of parts of drbd)
some virtualization stuff (different mount name spaces, this fs still
visible in some of the guests)
some nfs stuff (lockd still holding references)
some unix domain socket with relative path name still open within the fs on drbd
to understand the last one, here is an easy way to reproduce it:
cd $MPT; ssh-agent -a ./demo-socket-undetectable-by-fuser
cd / ; fuser -vm $MPT
you will then only be able to unmount $MPT if you kill that ssh-agent.
even though it will not show up in fuser.
any program that holds open an unix-domain socket can trigger this.
typically, programs like "postfix" have their cwd on the same fs, still,
and are thus detected by fuser. but if they chdir /, and hold no other
fd references to that fs, there is no tool available to detect which
process is using the file system still.
apart from looking at the full process list, and "knowing" which
processes might have used the fs in question,
there is nothing that could help here short of a rewrite/extension of
the unix-domain socket code.
: Lars Ellenberg Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :
More information about the drbd-user