[DRBD-user] crm-fence-peer.sh an multiple communication channels

CART Andreas andreas.cart at sonorys.at
Thu Jan 19 18:23:46 CET 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.

Hi again

Since there has been no response so far I went on and tried myself to get the script working.
Currently my quick-and-dirty solution is to replace all occurrences of HOSTNAME and DRBD_PEER with new variables which I fill with the correct names.
At least for the first basic scenario this seems to be working correctly.

Any comments on this?

And there is a related question:
Do you have any recommendations on which tests to perform to verify the correct operation of the (modified) script?
My understanding is that the fencing-part will be called if no DRDB-connection is possible and a peer changes to role=primary.
And the unfencing-part is called when the DRBD-connection is re-established.
Is that correct? Or are there some more conditions to consider?
Testing should moreover consider situations with the pacemaker-node being reachable and such with an unreachable pacemaker-node, isn't it?
Anything else?

Kind regards
-----Original Message-----
From: drbd-user-bounces at lists.linbit.com [mailto:drbd-user-bounces at lists.linbit.com] On Behalf Of CART Andreas
Sent: Dienstag, 17. Jänner 2017 16:39
To: drbd-user at lists.linbit.com
Subject: [DRBD-user] crm-fence-peer.sh an multiple communication channels


I have a basic 2 node active/passive cluster with Pacemaker (1.1.14 , pcs: 0.9.148) / CMAN ( / Corosync (1.4.7) on RHEL 6.8.
This cluster runs HA-services on top of DRBD (8.4.4).

Basically the system is working on both nodes and I can switch the resources from one node to the other.
But (in case of appropriate test-cases all automatic) calls of crm-fence-peer.sh end with exit code 5; i.e. timeout after 60 seconds.

After some debugging sessions I am pretty sure the problem is due to the script comparing information read from the CIB against $HOSTNAME.
In our case no matching entries are found and thus the script tries over and over again until the timeout expires.

The reason why no matching entries are found is our use of multiple communication channels:
-- bond0: standard-interface matching $HOSTNAME
                    used to offer the clustered HA-services via clustered IP-addresses and used to access the individual nodes for maintenance
-- bond1: interface reserved for cluster-communication (pacemaker, corosync) matching the information found in the CIB
-- bond2: interface reserved for DRBD-replication matching the information found in the DRBD-configuration
I.e. each node has actually 3 different names, one on each interface.

I think this setup with dedicated communication channels is common.
So I guess there must be some best practice how to deal with this situation and I just miss some configuration item to tell the system how to use the correct names.

Please advise ...

Kind regards
drbd-user mailing list
drbd-user at lists.linbit.com

More information about the drbd-user mailing list