[DRBD-user] pacemaker + drb9 and prefers location constraint

Mon Dec 11 20:51:00 CET 2023

Hello , everyone. I am seeing a behaviour that I can not understand
very well when I have drbd managed by pacemaker and I have a prefers
location constraint.

These are my resources:

Full List of Resources:
  * Clone Set: DRBDData-clone [DRBDData] (promotable):
    * Promoted: [ pcs01 ]
    * Unpromoted: [ pcs02 pcs03 ]
  * Resource Group: nfs:
    * portblock_on_nfs    (ocf:heartbeat:portblock):     Started pcs01
    * vip_nfs    (ocf:heartbeat:IPaddr2):     Started pcs01
    * drbd_fs    (ocf:heartbeat:Filesystem):     Started pcs01
    * nfsd    (ocf:heartbeat:nfsserver):     Started pcs01
    * exportnfs    (ocf:heartbeat:exportfs):     Started pcs01
    * portblock_off_nfs    (ocf:heartbeat:portblock):     Started pcs01

And I have a location preference for pcs01 and DRBD-clone
resource 'DRBDData-clone' prefers node 'pcs01' with score INFINITY

# drbdadm status
exports role:Primary
  disk:UpToDate
  pcs02.lan role:Secondary
    peer-disk:UpToDate
  pcs03.lan role:Secondary
    peer-disk:UpToDate

Now, while pcs01 is providing the resources, I mount the NFS export in
some client and start copying a 15GB random file.
After copying 5 GB I pull the plug of the pcs01 node. After a few
seconds pcs02 is promoted and the copy resumes.

Output for drbdadm status:

exports role:Primary
  disk:UpToDate
  pcs01.lan connection:Connecting
  pcs03.lan role:Secondary congested:yes ap-in-flight:1032 rs-in-flight:0
    peer-disk:UpToDate

output for pcs status

Node List:
  * Online: [ pcs02 pcs03 ]
  * OFFLINE: [ pcs01 ]

Full List of Resources:
  * Clone Set: DRBDData-clone [DRBDData] (promotable):
    * Promoted: [ pcs02 ]
    * Unpromoted: [ pcs03 ]
    * Stopped: [ pcs01 ]
  * Resource Group: nfs:
    * portblock_on_nfs    (ocf:heartbeat:portblock):     Started pcs02
    * vip_nfs    (ocf:heartbeat:IPaddr2):     Started pcs02
    * drbd_fs    (ocf:heartbeat:Filesystem):     Started pcs02
    * nfsd    (ocf:heartbeat:nfsserver):     Started pcs02
    * exportnfs    (ocf:heartbeat:exportfs):     Started pcs02
    * portblock_off_nfs    (ocf:heartbeat:portblock):     Started pcs02

Now after the 15GB file is near 14GB copied (so, at least 9GB needed
to resync if pcs01 is back online) I start the pcs01 again.

Since it has preference on pacemaker, as soon as pacemaker detects it,
the service will move there.
The question is, how can a "inconsistent/degraded" replica become
primary without the resync is completed ?

# drbdadm  status
exports role:Primary
  disk:Inconsistent
  pcs02.lan role:Secondary
    replication:SyncTarget peer-disk:UpToDate done:79.16
  pcs03.lan role:Secondary
    replication:PausedSyncT peer-disk:UpToDate done:78.24
resync-suspended:dependency

The service moved back to pcs01:
Node List:
  * Online: [ pcs01 pcs02 pcs03 ]

Full List of Resources:
  * Clone Set: DRBDData-clone [DRBDData] (promotable):
    * Promoted: [ pcs01 ]
    * Unpromoted: [ pcs02 pcs03 ]
  * Resource Group: nfs:
    * portblock_on_nfs  (ocf:heartbeat:portblock):     Started pcs01
    * vip_nfs   (ocf:heartbeat:IPaddr2):         Started pcs01
    * drbd_fs   (ocf:heartbeat:Filesystem):     Started pcs01
    * nfsd    (ocf:heartbeat:nfsserver):     Started pcs01
    * exportnfs (ocf:heartbeat:exportfs):        Started pcs01
    * portblock_off_nfs (ocf:heartbeat:portblock):     Started pcs01

# drbdadm --version
DRBDADM_BUILDTAG=GIT-hash:\ fd0904f7bf256ecd380e1c19ec73c712f3855d40\
build\ by\ mockbuild at 42fe748df8a24339966f712147eb3bfd\,\ 2023-11-01\
01:47:26
DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x090111
DRBD_KERNEL_VERSION=9.1.17
DRBDADM_VERSION_CODE=0x091a00
DRBDADM_VERSION=9.26.0
# cat /etc/redhat-release
AlmaLinux release 9.3 (Shamrock Pampas Cat)

 Is that a bug ? Shouldn´t that corrupt the filesystem ?

Atenciosamente/Kind regards,
Salatiel