[DRBD-user] DRBD + RHCS - Failover not working

Thu Dec 10 02:06:44 CET 2009

Hi Gianluca,

I'm pleased to announce I now have it working!!!!  Yay!!!
Turns out the DRBD configuration was just fine.
I was a combination of using the right resource type (postgres-8) and the order/dependencies of the resources.
Thanks for your help!!!

Here's my cluster.conf and drbd.conf for anyone who might be interested.

drbd.conf

global {
  usage-count yes;
}
common {
  protocol C;
}
resource drbd_disk {
  on rhcsnode1 {
    device    /dev/drbd0;
    disk      /dev/hdc1;
    address   10.10.10.100:7789;
    meta-disk internal;
  }
  on rhcsnode2 {
    device    /dev/drbd0;
    disk      /dev/hdc1;
    address   10.10.10.101:7789;
    meta-disk internal;
  }
}

Cluster.conf

<?xml version="1.0"?>
<cluster alias="pgsql_cluster" config_version="72" name="pgsql_cluster">
  <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
  <clusternodes>
    <clusternode name="rhcsnode1.localdomain" nodeid="2" votes="1">
      <fence/>
    </clusternode>
    <clusternode name="rhcsnode2.localdomain" nodeid="3" votes="1">
      <fence/>
    </clusternode>
  </clusternodes>
  <cman expected_votes="1" two_node="1"/>
  <fencedevices/>
  <rm>
    <failoverdomains>
      <failoverdomain name="fo_domain" nofailback="0" ordered="0" restricted="0"/>
    </failoverdomains>
    <resources>
      <ip address="10.10.10.150" monitor_link="1"/>
      <postgres-8 config_file="/etc/cluster/postgres-8/postgres-8:pgsql_db/postgresql.conf" name="pgsql_db" postmaster_user="postgres" shutdown_wait="3"/>
      <fs device="/dev/drbd/by-res/drbd_disk" fstype="ext3" mountpoint="/var/lib/pgsql/data" name="fs_pgsql" options="noatime"/>
      <drbd name="res_drbd" resource="drbd_disk"/>
    </resources>
    <service autostart="1" exclusive="1" name="mezeo_ha_db" recovery="relocate">
      <ip ref="10.10.10.150"/>
      <drbd ref="res_drbd">
        <fs ref="fs_pgsql">
          <postgres-8 ref="pgsql_db"/>
        </fs>
      </drbd>
    </service>
  </rm>
</cluster>

On Dec 9, 2009, at 4:11 PM, Gianluca Cecchi wrote:

> On Wed, Dec 9, 2009 at 9:03 PM, James Perry <jperry at mezeo.com> wrote:
>> Hi Gianluca,
>> 
>> You have described exactly what I'm doing to test this.  I kill the postmaster process.  I've tried both a script resource as well as a postgres-8 resource but neither seem to work.  What's odd is that I can relocate the service to either node without issue.  It's only when I "Break" a node does the recovery fail.
>> 
>> Could this be a CentOS version issue?  I'm using CentOS 5.3.
>> 
>> Let me try to clean up all my resources and try again.  I'll also be installing a CentOS 5.4 to see if that matters.
>> 
>> Would you by chance have an example cluster.conf file that has Postgres and DRBD?
>> 
>> Thanks!
>> 
> 
> probably we are going off topic; sorry to the list...
> I have not managed PostgreSQL in a HA environment yet.
> But I think I was partially wrong inside my previous post.
> 
> Important: Always return "0" if the status is non-fatal.
> 
> So in both cases where you get 1 or 3, they are fatal.... and my
> answer was not correct
> 
> I think you should read
> http://sources.redhat.com/cluster/wiki/FAQ/RGManager
> 
> and in particular the sections regarding:
> The rgmanager keeps stopping and restarting
> mysql/named/ypserv/httpd/other script. Why?
> and eventually
> Can I have rgmanager behave differently based on the return code of my
> init script?
> 
> One aim of rhel 5 was to have all the init scripts LSB compliant (in
> particular this means that a stop action against a not running service
> should always return 0)
> See https://bugzilla.redhat.com/show_bug.cgi?id=151104
> 
> As they provide a custom ocf script for PostgreSQL, probably they
> didn't correct the standard init script
> Here @home, where I'm sitting now, I don't have CentOS/RHEL; but I
> have an F11 system and still I get on it:
> [root at tekkaman ~]# service postgresql stop
> Stopping postgresql service:                               [  OK  ]
> [root at tekkaman ~]# echo $?
> 0
> [root at tekkaman ~]# service postgresql stop
> Stopping postgresql service:                               [FAILED]
> [root at tekkaman ~]# echo $?
> 1
> 
> When you manually relocate, you start in a situation where you have
> the service running, so that the stop action succeeds and the same is
> true for the start action on the other node.
> When you kill postmaster, at the first check the status action fails,
> so that rgmanager is designed to
> 1) stop the service (I think to eventually clean things in case of
> improper termination of service, to release locks, and more
> importantly to protect from data corruption)
> 2) start the service (and its dependencies then if successfully) in
> the other node
> 
> As the stop fails too, rgmanager gives up (probably you get your
> service in a FAILED state, I presume).
> It thinks this way: if I'm not able to cleanly stop the service,
> probably it is better not to try to start it on the other node.....
> DATA protection is first priority...
> and I agree wit this!
> 
> So probably your choice is between:
> 1) manually modify postgresql standard init script
> 2) use the provided ocf resource postgres-8
> 
> I warmly suggest 2).... did you try it?
> HIH,
> Gianluca

James Perry
Principal Consultant
Mezeo Software
t: 713.244.0859
f: 713.244.0851
m: 713.444.0251