[DRBD-user] Pacemaker/DRBD9 fail-over issue Centos8

Brent Jensen jeneral9 at gmail.com
Thu Jan 14 22:27:50 CET 2021

Problem: When performing "pcs node standby" on the current master, this
node demotes fine but the slave doesn't promote to master. It keeps
 looping the same error including "Refusing to be Primary while peer is
 not outdated" and "Could not connect to the CIB." At this point the old
 master has already unloaded drbd. The only way to fix it is to start  drbd
on the standby node (e.g. drbdadm r0 up). Logs contained herein are  from
the node trying to be master.

I have done this on DRBD9/Centos7 w/o any problems. So I don't know were
the issue is (crm-fence-peer.9.sh? DRBD? newer pacemaker?). DRBD seems  to
work fine; unclear if there are some additional configs I need to do.
There are some slight pcs config changes between Centos 7 & 8.

Appreciate any help!


Basic Config (Centos 8 packages):
2 Node Master/Slave
OS: Centos8
Pacemaker: pacemaker-2.0.4-6.el8_3.1
Corosync: corosync-3.0.3-4.el8

DRBD config:
resource r0 {
        protocol C;

        disk {
                on-io-error             detach;
                no-disk-flushes ;
                c-plan-ahead 10;
                c-fill-target 24M;
                c-min-rate 10M;
                c-max-rate 1000M;
        net {
                fencing resource-only;

                # max-epoch-size        20000;
                max-buffers             36k;
                sndbuf-size             1024k ;
                rcvbuf-size             2048k;
        handlers {
                # these handlers are necessary for drbd 9.0 + pacemaker
                fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh --timeout 30
--dc-timeout 60";
                after-resync-target "/usr/lib/drbd/crm-unfence-peer.9.sh";
        options {
        auto-promote yes;
        on nfs5 {
                node-id   0;
                device    /dev/drbd0;
                disk      /dev/sdb1;
                meta-disk internal;
        on nfs6 {
                node-id   1;
                device    /dev/drbd0;
                disk      /dev/sdb1;
                meta-disk internal;

Pacemaker Config
Cluster Name: nfs
Corosync Nodes:
 nfs5 nfs6
Pacemaker Nodes:
 nfs5 nfs6

 Group: cluster_group
  Resource: fs_drbd (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/drbd0 directory=/data/ fstype=xfs
   Meta Attrs: target-role=Started
   Operations: monitor interval=20s timeout=40s
               start interval=0 timeout=60 (fs_drbd-start-interval-0)
               stop interval=0 timeout=60 (fs_drbd-stop-interval-0)
 Clone: drbd0-clone
  Meta Attrs: clone-max=2 clone-node-max=1 notify=true promotable=true
promoted-max=1 promoted-node-max=1
  Resource: drbd0 (class=ocf provider=linbit type=drbd)
   Attributes: drbd_resource=r0
   Operations: demote interval=0s timeout=90 (drbd0-demote-interval-0s)
               monitor interval=20 role=Slave timeout=20
               monitor interval=10 role=Master timeout=20
               notify interval=0s timeout=90 (drbd0-notify-interval-0s)
               promote interval=0s timeout=90 (drbd0-promote-interval-0s)
               reload interval=0s timeout=30 (drbd0-reload-interval-0s)
               start interval=0s timeout=240 (drbd0-start-interval-0s)
               stop interval=0s timeout=100 (drbd0-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote drbd0-clone then start cluster_group (kind:Mandatory)
Colocation Constraints:
  cluster_group with drbd0-clone (score:INFINITY) (with-rsc-role:Master)
Ticket Constraints:

 No alerts defined

Resources Defaults:
  No defaults set
Operations Defaults:
  No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: nfs
 dc-version: 2.0.4-6.el8_3.1-2deceaa3ae
 have-watchdog: false
 last-lrm-refresh: 1610570527
 no-quorum-policy: ignore
 stonith-enabled: false

 No tags defined

    wait_for_all: 0

Error Logs

pacemaker-controld[7673]: notice: Result of notify operation for drbd0 on
nfs5: ok
kernel: drbd r0 nfs6: peer( Primary -> Secondary )
pacemaker-controld[7673]: notice: Result of notify operation for drbd0 on
nfs5: ok
pacemaker-controld[7673]: notice: Result of notify operation for drbd0 on
nfs5: ok
kernel: drbd r0 nfs6: Preparing remote state change 3411954157
kernel: drbd r0 nfs6: Committing remote state change 3411954157
kernel: drbd r0 nfs6: conn( Connected -> TearDown ) peer( Secondary ->
Unknown )
kernel: drbd r0/0 drbd0 nfs6: pdsk( UpToDate -> DUnknown ) repl(
Established -> Off )
kernel: drbd r0 nfs6: ack_receiver terminated
kernel: drbd r0 nfs6: Terminating ack_recv thread
kernel: drbd r0 nfs6: Restarting sender thread
drbdadm[89851]: drbdadm: Unknown command 'disconnected'
kernel: drbd r0 nfs6: Connection closed
kernel: drbd r0 nfs6: helper command: /sbin/drbdadm disconnected
kernel: drbd r0 nfs6: helper command: /sbin/drbdadm disconnected exit code
1 (0x100)
kernel: drbd r0 nfs6: conn( TearDown -> Unconnected )
kernel: drbd r0 nfs6: Restarting receiver thread
kernel: drbd r0 nfs6: conn( Unconnected -> Connecting )
pacemaker-attrd[7671]: notice: Setting master-drbd0[nfs6]: 10000 -> (unset)
pacemaker-attrd[7671]: notice: Setting master-drbd0[nfs5]: 10000 -> 1000
pacemaker-controld[7673]: notice: Result of notify operation for drbd0 on
nfs5: ok
pacemaker-controld[7673]: notice: Result of notify operation for drbd0 on
nfs5: ok
kernel: drbd r0 nfs6: helper command: /sbin/drbdadm fence-peer
UP_TO_DATE_NODES=0x00000001 /usr/lib/drbd/crm-fence-peer.9.sh
crm-fence-peer.9.sh[89928]: (qb_rb_open_2) #011debug: shm size:131085;
real_size:135168; rb->word_size:33792
crm-fence-peer.9.sh[89928]: (qb_rb_open_2) #011debug: shm size:131085;
real_size:135168; rb->word_size:33792
crm-fence-peer.9.sh[89928]: (qb_rb_open_2) #011debug: shm size:131085;
real_size:135168; rb->word_size:33792
crm-fence-peer.9.sh[89928]: (connect_with_main_loop) #011debug: Connected
to controller IPC (attached to main loop)
crm-fence-peer.9.sh[89928]: (post_connect) #011debug: Sent IPC hello to
crm-fence-peer.9.sh[89928]: (qb_ipcc_disconnect) #011debug:
crm-fence-peer.9.sh[89928]: (qb_rb_close_helper) #011debug: Closing
ringbuffer: /dev/shm/qb-7673-89963-13-RTpTPN/qb-request-crmd-header
crm-fence-peer.9.sh[89928]: (qb_rb_close_helper) #011debug: Closing
ringbuffer: /dev/shm/qb-7673-89963-13-RTpTPN/qb-response-crmd-header
crm-fence-peer.9.sh[89928]: (qb_rb_close_helper) #011debug: Closing
ringbuffer: /dev/shm/qb-7673-89963-13-RTpTPN/qb-event-crmd-header
crm-fence-peer.9.sh[89928]: (ipc_post_disconnect) #011info: Disconnected
from controller IPC API
crm-fence-peer.9.sh[89928]: (pcmk_free_ipc_api) #011debug: Releasing
controller IPC API
crm-fence-peer.9.sh[89928]: (crm_xml_cleanup) #011info: Cleaning up memory
from libxml2
crm-fence-peer.9.sh[89928]: (crm_exit) #011info: Exiting crm_node | with
status 0
crm-fence-peer.9.sh[89928]: /
crm-fence-peer.9.sh[89928]: Could not connect to the CIB: No such device or
crm-fence-peer.9.sh[89928]: Init failed, could not perform requested
crm-fence-peer.9.sh[89928]: WARNING DATA INTEGRITY at RISK: could not place
the fencing constraint!
kernel: drbd r0 nfs6: helper command: /sbin/drbdadm fence-peer exit code 1
kernel: drbd r0 nfs6: fence-peer helper broken, returned 1
kernel: drbd r0: State change failed: Refusing to be Primary while peer is
not outdated
kernel: drbd r0: Failed: role( Secondary -> Primary )
kernel: drbd r0 nfs6: helper command: /sbin/drbdadm fence-peer
DRBD_BACKING_DEV_0=/dev/sdb1 DRBD_CONF=/etc/drbd.conf
UP_TO_DATE_NODES=0x00000001 /usr/lib/drbd/crm-fence-peer.9.sh
crm-fence-peer.9.sh[24197]: (qb_rb_open_2) #011debug: shm size:131085;
real_size:135168; rb->word_size:33792
crm-fence-peer.9.sh[24197]: (qb_rb_open_2) #011debug: shm size:131085;
real_size:135168; rb->word_size:33792
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20210114/741947f6/attachment.htm>

More information about the drbd-user mailing list