Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Dec 27, 2007 at 12:29:56PM -0000, Rodrigo Pereira wrote: > Hello, > > My cluster had a hiccup today. Primary node was manually soft-rebooted by > someone, and DRBD on secondary node was on a loop trying to start. > > I browsed the Filesystem RA, and i see it uses fuser to try and remove > processes attached to the fs. Obviously this didn't work, as fuser does not > return 0. So i ask, what kind of known or hipothetical situation could > originate this problem? I believe i had a bash planted on the fs, but that > souldn't be a problem for fuser -k. I'm also sending this to DRBD list, > maybe it's something just DRBD related. > > Logs showed this when the cluster was shutting down on the primary node: > > Filesystem[10970]: 2007/12/27_09:52:27 INFO: Running stop for > /dev/drbd0 on /drbd0 > Filesystem[10970]: 2007/12/27_09:52:27 INFO: Trying to unmount /drbd0 > lrmd[4900]: 2007/12/27_09:52:27 info: RA output: (fs0:stop:stderr) umount: > /drbd0: device is busy > umount: /drbd0: device is busy > > [... trying to umount several times with fuser and SIGTERM/KILL signals ...] > > Filesystem[10970]: 2007/12/27_09:52:32 ERROR: Couldn't unmount /drbd0; > trying cleanup with SIGKILL > Filesystem[10970]: 2007/12/27_09:52:32 INFO: No processes on /drbd0 > were signalled > Filesystem[10970]: 2007/12/27_09:52:33 ERROR: Couldn't unmount /drbd0, > giving up! > lrmd[4900]: 2007/12/27_09:52:34 WARN: Exiting fs0:stop process 10970 > returned rc 1. > crmd[4903]: 2007/12/27_09:52:34 ERROR: process_lrm_event: LRM operation > fs0_stop_0 (call=83, rc=1) Error unknown error could be unlikely, but possible: some strangely misconfigured udev stuff fast respawning processes which use the fs on drbd some devicemapper stuff (device mappings of parts of drbd) more likely: some virtualization stuff (different mount name spaces, this fs still visible in some of the guests) some nfs stuff (lockd still holding references) some unix domain socket with relative path name still open within the fs on drbd to understand the last one, here is an easy way to reproduce it: MPT=/drbd0 cd $MPT; ssh-agent -a ./demo-socket-undetectable-by-fuser cd / ; fuser -vm $MPT you will then only be able to unmount $MPT if you kill that ssh-agent. even though it will not show up in fuser. any program that holds open an unix-domain socket can trigger this. typically, programs like "postfix" have their cwd on the same fs, still, and are thus detected by fuser. but if they chdir /, and hold no other fd references to that fs, there is no tool available to detect which process is using the file system still. apart from looking at the full process list, and "knowing" which processes might have used the fs in question, there is nothing that could help here short of a rewrite/extension of the unix-domain socket code. :) -- : Lars Ellenberg Tel +43-1-8178292-55 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :