[DRBD-user] State change failed: (-12) Device is held open by someone

Wed Aug 13 22:43:54 CEST 2008

Hello,

I am not running nfs, but I am seeing the same problem as described
below. In my case the problem is intermittent. Most failovers are
successful, but every so often the "device is held open" error occurs. 

As mentioned below, heartbeat will reboot the active machine when this
problem occurs and proceed with the failover to the standby. Everything
works fine after the reboot; the machine comes back up in secondary
state as expected. However, I'd like to fix the problem and thus prevent
the reboot from occurring.

I am able to reproduce the problem in an environment where the machine
does not reboot (it stays in the "device is held open" state). At this
point I execute commands "lsof", "fuser -mv", "ps", and look at
/proc/mounts, but I cannot figure out who is holding open the drbd
device. I am running LVM on top of DRBD. It looks to me like my
application has been shut down cleanly, files have been unmounted, and
the LVM volume group has been deactivated.

I looked at the DRBD source code in an attempt to understand how it
determines that the device is held open. It looks like it is based on an
internal count of open devices. 

Any tips or suggestions would be greatly appreciated on how to further
debug the problem.

I am running DRBD version 8.0.11 and heartbeat version 2.1.3.

Thanks,
Chris 

> 
> Hi Guys,
> 
> I'm attempting to set up high availability nfs with heartbeat and DRBD
> as the block device,
> my problem arises when attempting to fail over from master->slave.
> 
> The process is as follows:
> 
> Kill nfsd with signal 9 (to prevent state/lock file deletion).
> Unmount the drbd block device
> Change drbd state to secondary
> 
> It is in this last step that I'm facing issues, and receiving this
> error:
> 
> hanfs1:~# drbdadm secondary export
> /dev/drbd1: State change failed: (-12) Device is held open by someone
> Command 'drbdsetup /dev/drbd1 secondary' terminated with exit code 11
> 
> There are no other services running on these machines, and looking at
> lsof output, the only
> things that are accessing anything drbd related are the worker,
receiver
> and asender.
> 
> hanfs1:~# lsof | grep drbd
> drbd1_wor 2797 root  cwd       DIR              254,0    4096
128
> /
> drbd1_wor 2797 root  rtd       DIR              254,0    4096
128
> /
> drbd1_wor 2797 root  txt   unknown
> /proc/2797/exe
> drbd1_rec 2802 root  cwd       DIR              254,0    4096
128
> /
> drbd1_rec 2802 root  rtd       DIR              254,0    4096
128
> /
> drbd1_rec 2802 root  txt   unknown
> /proc/2802/exe
> drbd1_ase 2808 root  cwd       DIR              254,0    4096
128
> /
> drbd1_ase 2808 root  rtd       DIR              254,0    4096
128
> /
> drbd1_ase 2808 root  txt   unknown
> /proc/2808/exe
> 
> Is there any way to work around this locking - forcefully or
otherwise?
> Heartbeat is resorting
> to hard rebooting the machine when failing over due to this message.
> Alternatively, has anyone
> successfully resolved this issue when utilizing drbd in a ha-nfs
> setting?
> 
> Many thanks
> 
> --
> Tony Dodd
> Last.fm | http://www.last.fm
> Karen House 1-11 Baches Street
> London N1 6DL
> 
> check out my music taste at:
> http://www.last.fm/user/hawkeviper
>