[DRBD-user] Pacemaker could not switch drbd nodes

Tue Mar 27 07:35:45 CEST 2018

Hi,

On Fri, Mar 23, 2018 at 9:01 AM, Lozenkov Sergei <slozenkov at gmail.com>
wrote:

> Hello.
> I have two Debian 9 servers with configured  Corosync-Pacemaker-DRBD. All
> work well for month.
> After some servers issues (with reboots) I have situation that pacemaker
> could not switch drbd node with such errors:
>
> Mar 16 06:25:11 [877] nfs01-az-eus.tech-corps.com       lrmd:   notice:
> operation_finished:     drbd_nfs_stop_0:3667:stderr [ 1: State change
> failed: (-12) Device is held open by someone ]
>
> Mar 16 06:25:11 [877] nfs01-az-eus.tech-corps.com       lrmd:   notice:
> operation_finished:     drbd_nfs_stop_0:3667:stderr [ Command 'drbdsetup-84
> secondary 1' terminated with exit code 11 ]
>
> Mar 16 06:25:11 [877] nfs01-az-eus.tech-corps.com       lrmd:     info:
> log_finished:   finished - rsc:drbd_nfs action:stop call_id:47 pid:3667
> exit-code:1 exec-time:20002ms queue-time:0ms
>
> Mar 16 06:25:11 [880] nfs01-az-eus.tech-corps.com       crmd:    error:
> process_lrm_event:      Result of stop operation for drbd_nfs on
> nfs01-az-eus.tech-corps.com: Timed Out | call=47 key=drbd_nfs_stop_0
> timeout=20000ms
>
> Mar 16 06:25:11 [880] nfs01-az-eus.tech-corps.com       crmd:   notice:
> process_lrm_event:      nfs01-az-eus.tech-corps.com-drbd_nfs_stop_0:47 [
> 1: State change failed: (-12) Device is held open by someone\nCommand
> 'drbdsetup-84 secondary 1' terminated with exit code 11\n1: State change
> failed: (-12) Device is held open by someone\nCommand 'drbdsetup-84
> secondary 1' terminated with exit code 11\n1: State change failed: (-12)
> Device is held open by someone\nCommand 'drbdsetup-84 secondary 1'
> terminated with exit
>
> I tried to resolve the issue with many googled receipts but all attempts
> were unsuccessful.
> As well I have another two node cluster with exactly the same
> configuration and it works without any issues.
>
> Right now I placed nodes to standby mode and manually raised all services.
> Please, could You help me to analyze and solve the problem?
> Thanks
>
> Here are my configuration files:
> --- CRM CONFIG ---
> crm configure show
> node 171049224: nfs01-az-eus.tech-corps.com \
>         attributes standby=off
> node 171049225: nfs02-az-eus.tech-corps.com \
>         attributes standby=on
> primitive drbd_nfs ocf:linbit:drbd \
>         params drbd_resource=nfs \
>         op monitor interval=29s role=Master \
>         op monitor interval=31s role=Slave
> primitive fs_nfs Filesystem \
>         params device="/dev/drbd1" directory="/data" fstype=ext4 \
>         meta is-managed=true
> primitive nfs lsb:nfs-kernel-server \
>         op monitor interval=5s
> primitive nmbd lsb:nmbd \
>         op monitor interval=5s
> primitive smbd lsb:smbd \
>         op monitor interval=5s
> group NFS fs_nfs nfs nmbd smbd
> ms ms_drbd_nfs drbd_nfs \
>         meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> notify=true
> order fs-nfs-before-nfs inf: fs_nfs:start nfs:start
> order fs-nfs-before-nmbd inf: fs_nfs:start nmbd:start
> order fs-nfs-before-smbd inf: fs_nfs:start smbd:start
> order ms-drbd-nfs-before-fs-nfs inf: ms_drbd_nfs:promote fs_nfs:start
> colocation ms-drbd-nfs-with-ha inf: ms_drbd_nfs:Master NFS
> order nmbd-before-smbd inf: nmbd:start smbd:start
> property cib-bootstrap-options: \
>         have-watchdog=false \
>         dc-version=1.1.16-94ff4df \
>         cluster-infrastructure=corosync \
>         cluster-name=debian \
>         stonith-enabled=false \
>         no-quorum-policy=ignore
>
>
>
> --- DRBD GLOBAL ---
> cat /etc/drbd.d/global_common.conf | grep -v '#'
>
> global {
>         usage-count no;
> }
>
> common {
>         protocol C;
>
>         handlers {
>
>         }
>
>         startup {
>         }
>
>         options {
>         }
>
>         disk {
>         }
>
>         net {
>         }
> }
>
>
> --- DRBD -RESOURCE ---
> cat /etc/drbd.d/nfs.res | grep -v '#'
> resource nfs{
>   meta-disk internal;
>   device /dev/drbd1;
>   syncer {
>     verify-alg sha1;
>         rate 100M;
>   }
>
>   net{
>     max-buffers 8000;
>     max-epoch-size 8000;
>     unplug-watermark 16;
>     sndbuf-size 0;
>   }
>
>   disk{
>     disk-barrier no;
>     disk-flushes no;
>   }
>
>   on nfs01-az-eus.tech-corps.com{
>     disk /dev/sdc1;
>     address 10.50.1.8:7789;
>   }
>
>   on nfs02-az-eus.tech-corps.com{
>     disk /dev/sdc1;
>     address 10.50.1.9:7789;
>   }
> }
>
>
>
>
> --
> Segey L
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>
Did you check with fuser what is holding the device/filesystem busy?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20180327/8e6c1556/attachment.htm>