[DRBD-user] Heartbeat Filesystem resource agent (on top of DRBD RA) unmounting failure

Lars Ellenberg lars.ellenberg at linbit.com
Thu Dec 27 20:43:14 CET 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Thu, Dec 27, 2007 at 12:29:56PM -0000, Rodrigo Pereira wrote:
> Hello,
> 
> My cluster had a hiccup today. Primary node was manually soft-rebooted by
> someone, and DRBD on secondary node was on a loop trying to start.
> 
> I browsed the Filesystem RA, and i see it uses fuser to try and remove
> processes attached to the fs. Obviously this didn't work, as fuser does not
> return 0. So i ask, what kind of known or hipothetical situation could
> originate this problem? I believe i had a bash planted on the fs, but that
> souldn't be a problem for fuser -k. I'm also sending  this to DRBD list,
> maybe it's something just DRBD related.
> 
> Logs showed this when the cluster was shutting down on the primary node:
> 
> Filesystem[10970]:      2007/12/27_09:52:27 INFO: Running stop for
> /dev/drbd0 on /drbd0
> Filesystem[10970]:      2007/12/27_09:52:27 INFO: Trying to unmount /drbd0
> lrmd[4900]: 2007/12/27_09:52:27 info: RA output: (fs0:stop:stderr) umount:
> /drbd0: device is busy
> umount: /drbd0: device is busy
> 
> [... trying to umount several times with fuser and SIGTERM/KILL signals ...]
> 
> Filesystem[10970]:      2007/12/27_09:52:32 ERROR: Couldn't unmount /drbd0;
> trying cleanup with SIGKILL
> Filesystem[10970]:      2007/12/27_09:52:32 INFO: No processes on /drbd0
> were signalled
> Filesystem[10970]:      2007/12/27_09:52:33 ERROR: Couldn't unmount /drbd0,
> giving up!
> lrmd[4900]: 2007/12/27_09:52:34 WARN: Exiting fs0:stop process 10970
> returned rc 1.
> crmd[4903]: 2007/12/27_09:52:34 ERROR: process_lrm_event: LRM operation
> fs0_stop_0 (call=83, rc=1) Error unknown error

could be
 unlikely, but possible:
 some strangely misconfigured udev stuff
 fast respawning processes which use the fs on drbd
 some devicemapper stuff (device mappings of parts of drbd)

 more likely:
 some virtualization stuff (different mount name spaces, this fs still
 visible in some of the guests)
 some nfs stuff (lockd still holding references)
 some unix domain socket with relative path name still open within the fs on drbd

to understand the last one, here is an easy way to reproduce it:
  MPT=/drbd0
  cd $MPT; ssh-agent -a ./demo-socket-undetectable-by-fuser
  cd / ; fuser -vm $MPT

you will then only be able to unmount $MPT if you kill that ssh-agent.
even though it will not show up in fuser.

any program that holds open an unix-domain socket can trigger this.
typically, programs like "postfix" have their cwd on the same fs, still,
and are thus detected by fuser. but if they chdir /, and hold no other
fd references to that fs, there is no tool available to detect which
process is using the file system still.

apart from looking at the full process list, and "knowing" which
processes might have used the fs in question,
there is nothing that could help here short of a rewrite/extension of
the unix-domain socket code.

 :)

-- 
: Lars Ellenberg                            Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :



More information about the drbd-user mailing list