[DRBD-user] Dual-Primary DRBD node fenced after other node reboots UP

Raman Gupta ramangupta16 at gmail.com
Tue May 23 13:04:32 CEST 2017


> *why*

> DRBD would not do that by itself,
> so likely pacemaker decided to do that,
> and you have to figure out *why*.
> Pacemaker will have logged the reasons somewhere.

The crm-fence-peer.sh script could not find the status of peer node (which
went down) and assumed its status was "unknown" and thus placed a
constraint on DRBD with -INFINITY score which essentially demotes and stops
DRBD. The demotion failed because GFS2 was already mounted. This failure
was construed as error by Pacemaker and it scheduled stonith for itself
when the down node was back.

> "crm-fence-peer.sh" assumes that the result of "uname -n"
> is the local nodes "pacemaker node name".
Yes.

> If "uname -n" and "crm_node -n" do not return the same thing for you,
> the defaults will not work for you.

For my network the replication network (and its hostname) is different from
client facing network (and its hostname):
[root at server7]# uname -n
server7
[root at server7]# crm_node -n
server7ha

However things seems to be working with these settings.


>Then in addition to all your other trouble,
> you have missing dependency constraints.

The proper integration of DRBD+GFS2+DLM+CLVM resources into Pacemaker was
the issue. The pacemaker ordered constraints on these resources and
definition of these resources were tricky and took time to fix. Finally I
made DLM, CLVM, GFS2 as cloned resources and DRBD as master (with
master-max=2) for my dual-Primary setup. After this I arrived at correct
ordering of these resources:
Start & Promote DRBD then start DLM then start CLVM then start GFS2

Now things work fine.


To help anyone with similar situation here is my cluster status:
---------------------------------------------------------------------------------------------

[root at server4 ~]# pcs status
Cluster name: vCluster
Stack: corosync
Current DC: server4ha (version 1.1.15-11.el7_3.4-e174ec8) - partition with
quorum
Last updated: Tue May 23 15:53:20 2017          Last change: Mon May 22
22:13:08 2017 by root via cibadmin on server4ha

2 nodes and 11 resources configured

Online: [ server4ha server7ha ]

Full list of resources:

 vCluster-VirtualIP-10.168.10.199       (ocf::heartbeat:IPaddr2):
Started server4ha
 vCluster-Stonith-server4ha     (stonith:fence_ipmilan):        Started
server7ha
 vCluster-Stonith-server7ha     (stonith:fence_ipmilan):        Started
server4ha
 Clone Set: dlm-clone [dlm]
     Started: [ server4ha server7ha ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ server4ha server7ha ]
 Master/Slave Set: drbd_data_clone [drbd_data]
     Masters: [ server4ha server7ha ]
 Clone Set: Gfs2FS-clone [Gfs2FS]
     Started: [ server4ha server7ha ]

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
[root at server4 ~]#
[root at server4 ~]#
[root at server4 ~]#

My cluster constraints with ordered constraints in bold:
-----------------------------------------------------------------------------------
[root at server4 ~]# pcs constraint show
Location Constraints:
  Resource: vCluster-Stonith-server4ha
    Disabled on: server4ha (score:-INFINITY)
  Resource: vCluster-Stonith-server7ha
    Disabled on: server7ha (score:-INFINITY)
Ordering Constraints:
*  promote drbd_data_clone then start dlm-clone (kind:Mandatory)*
*  start dlm-clone then start clvmd-clone (kind:Mandatory)*
*  start clvmd-clone then start Gfs2FS-clone (kind:Mandatory)*
Colocation Constraints:
  dlm-clone with drbd_data_clone (score:INFINITY)
  clvmd-clone with dlm-clone (score:INFINITY)
  Gfs2FS-clone with clvmd-clone (score:INFINITY)
Ticket Constraints:
[root at server4 ~]#


Thanks for all your help.

-- Raman


On Fri, May 12, 2017 at 8:30 PM, Lars Ellenberg <lars.ellenberg at linbit.com>
wrote:

> On Fri, May 12, 2017 at 02:04:57AM +0530, Raman Gupta wrote:
> > > I don't think this has anything to do with DRBD, because:
> > OK.
> >
> > > Apparently, something downed the NICs for corosync communication.
> > > Which then leads to fencing.
> > No problem with NICs.
> >
> > > Maybe you should double check your network configuration,
> > > and any automagic reconfiguration of the network,
> > > and only start corosync once your network is "stable"?
> > As another manifestation of similar problem of dual-Primary DRBD
> integrated
> > with stonith enabled Pacemaker: When server7 goes down, the DRBD resource
> > on surviving node server4 is attempted to be demoted as secondary.
>
> *why*
>
> DRBD would not do that by itself,
> so likely pacemaker decided to do that,
> and you have to figure out *why*.
> Pacemaker will have logged the reasons somewhere.
>
> Seeing that you have different "uname -n" and "pacemaker node names",
> that may well be the source of all your troubles.
>
> "crm-fence-peer.sh" assumes that the result of "uname -n"
> is the local nodes "pacemaker node name".
>
> If "uname -n" and "crm_node -n" do not return the same thing for you,
> the defaults will not work for you.
>
> > The
> > demotion fails because DRBD is hosting a GFS2 volume and Pacemaker
> complains
> > of this failure as an error.
>
> Then in addition to all your other trouble,
> you have missing dependency constraints.
> IF pacemaker decides it needs to "demote" DRBD,
> it should know that it has a file system mounted,
> and should know that it needs to first unmount,
> and that it needs to first stop services accessing that mount,
> and so on.
>
> If it did not attempt to do that, your pacemaker config is broken.
> If it did attempt to do that and failed,
> you will have to look into why, which, again, should be in the logs.
>
> Double check constraints, and also double check if GFS2/DLM fencing is
> properly integrated with pacemaker.
>
> --
> : Lars Ellenberg
> : LINBIT | Keeping the Digital World Running
> : DRBD -- Heartbeat -- Corosync -- Pacemaker
>
> DRBD® and LINBIT® are registered trademarks of LINBIT
> __
> please don't Cc me, but send to list -- I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170523/e11ae4bc/attachment.htm>


More information about the drbd-user mailing list