[DRBD-user] HA DRBD setup - graceful failover/active node detection

Thu Jan 5 12:14:51 CET 2012

Hi Elias !

“crm status” will tell you on which node a given resource is active. You can
also use “crm_mon” (underscore!) which will present the same thing in real
time (crm status is a one shot run).

Basically, crm is the command to use to do everything you intend to do.

Regarding the iSCSI target daemon, you have declared an IP resource in your
cluster, the one that your remote iSCSI initiators point to. Since the IP
resource is a resource, you will see it in the crm status report, and you
will know which node owns it.

In order to failover your resources, guess what, you may use crm too! J As
far as I remember, it’s something like “crm resource migrate <res_name>
<target_host>”. Have a look to the crm man pages for more details.

You may also manually modify the cib config within the cluster to change the
scores of your resources. This is what I use to do, although I’m not sure it
is actually a best practice
 To make short, the “score” is sort of a
“weight” that you give to your resource on a given host. The host on which
the weight/score is the highest is the host on which the resource is tied
to. Change the scores, the resource moves.

You must read at least 2 docs to better understand that complex stuff :

1)      “Pacemaker 1.0 Configuration Explained”, by Andrew Beekhof. There
might be a more recent release, but I don’t know of it
 I had to read it
twice, but it gives valuable information regarding the way a Pacemaker
cluster is structured and works. This manual worths gold!

2)      And then the “CRM CLI guide” (not sure which version is the latest,
I have the 0.94) by Dejan Muhamedagic and Yan Gao, to understand all crm is
able to achieve, and that’s not few!

Also, the “Cluster from scratch” manual is a good introduction. An dit
contains DRBD examples. May be you might start by it, to catch the first
concepts
 It is easier to read than the “Pacemaker 1.0 Configuration
Explained” I mentioned above.

You’ll find all this on the web of course!

HTH!

Best regards,

Pascal.

De : drbd-user-bounces at lists.linbit.com
[mailto:drbd-user-bounces at lists.linbit.com] De la part de Elias
Chatzigeorgiou
Envoyé : jeudi 5 janvier 2012 03:14
À : drbd-user at lists.linbit.com
Objet : [DRBD-user] HA DRBD setup - graceful failover/active node detection

I have a two-node active/passive cluster, with DRBD controlled by
corosync/pacemaker. 

All storage is based on LVM.

----------------------------------------------------------------------------
--------

a) How do I know, which node of the cluster is currently active? 

   How can I check if a node is currently in use by the iSCSI-target daemon?

   I can try to deactivate a volume group using:

[root at node1 ~]# vgchange -an data

  Can't deactivate volume group "data" with 3 open logical volume(s)

In which case, if I get a message like the above then I know that 

node1 is the active node, but is there a better (non-intrusive)

way to check?

A better option seems to be 'pvs -v'. If the node is active then it shows
the volume names:

[root at node1 ~]# pvs -v

    Scanning for physical volume names

  PV         VG      Fmt  Attr PSize   PFree DevSize PV UUID

  /dev/drbd1 data    lvm2 a-   109.99g    0  110.00g
c40m9K-tNk8-vTVz-tKix-UGyu-gYXa-gnKYoJ

  /dev/drbd2 tempdb  lvm2 a-    58.00g    0   58.00g
4CTq7I-yxAy-TZbY-TFxa-3alW-f97X-UDlGNP

  /dev/drbd3 distrib lvm2 a-    99.99g    0  100.00g
l0DqWG-dR7s-XD2M-3Oek-bAft-d981-UuLReC

where on the inactive node it gives errors:

[root at node2 ~]# pvs -v

    Scanning for physical volume names

  /dev/drbd0: open failed: Wrong medium type

  /dev/drbd1: open failed: Wrong medium type

Any further ideas/comments/suggestions?

----------------------------------------------------------------------------
--------

b) how can I gracefully failover to the other node ? Up to now, the only way
I

   know is forcing the active node to reboot (by entering two subsequent
'reboot'

   commands). This however breaks the DRBD synchronization, and I need to 

   use a fix-split-brain procedure to bring back the DRBD in sync.

   On the other hand, if I try to stop the corosync service on the active
node,

   the command takes forever! I understand that the suggested procedure
should be

   to disconnect all clients from the active node and then stop services, 

   is it a better approach to shut down the public network interface before

   stopping the corosync service (in order to forcibly close client
connections)?

Thanks

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120105/fffe9edf/attachment.htm>