[DRBD-user] Re: Unable to make DRBD Resource Secondary

Mon Nov 14 15:32:46 CET 2005

Lars Ellenberg wrote:
> 
> / 2005-11-10 15:55:20 +0100
> \ Lars Marowsky-Bree:
> > On 2005-11-10T09:36:33, Todd Denniston <Todd.Denniston at ssa.crane.navy.mil> wrote:
> >
> > > I really hate when I am attempting to manually do a fallover[1] between
> > > machines and it fails when the original primary can't seem to let go of a
> > > drbd resource because of something in the kernel holding on (aka lsof &
> > > fuser cant find anything).
> > >
> > > [1] issue `service heartbeat stop`  on redhat/fedora.
> >
> > That really shouldn't happen.
> >
> > I trust that you are quite aware of how to use fuser/lsof and were
> > looking for the right things in there... fuser w/ and w/o -m on the drbd
> > block device _ought_, in theory, to list all the files. And if it is not
> > mounted (check via /proc/mounts), NFS shouldn't be able to have any
> > hidden references to it either.
> >
> > If all these predicates are right and it still can't set the device to
> > secondary mode claiming something has the device opened, that would be a
> > bug.
> >
> 
> well. "should". "ought".
> in the real world, facts are sometimes different.
> 
> we had occasionally the case that neither fuser nor lsof list anything.
> it had been unmounted. still it refused to become secondary.
> 
> stopping nfs-kernel-server (and related statd, lockd or whatever)
> made it possible, though. so there are cases where nfs (or related
> daemons) in kernel space hold references to a device, that user space
> tools won't see.
> 

My haresources controls the nfs services, i.e., the nfs server and nfslock
server.
on heartbeat stop, it should stop nfs[0][1] and then nfslock[0][2]. So from
what I am reading here I would think that the nfs servers should have
released the devices by the time datadisk gets a chance to call umount. Or
have I misunderstood what you were writing?

Also as Tom said "What information should we supply, debug information, to
help you debug the problem", and how do we trap the data, the next time it
happens?

[0] these are the names Red Hat/Fedora uses to control nfs services, and
from what I could see matched what the SUSE nfs script did, when I setup the
machines.

[1] nfs service takedown is:
killproc rpc.mountd
killproc nfsd
rm -f /var/lock/subsys/nfs

[2] nfslock service takedown is:
killproc rpc.statd
rm -f /var/lock/subsys/nfslock

-- 
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane) 
Harnessing the Power of Technology for the Warfighter