Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Gianluca,
I'm pleased to announce I now have it working!!!! Yay!!!
Turns out the DRBD configuration was just fine.
I was a combination of using the right resource type (postgres-8) and the order/dependencies of the resources.
Thanks for your help!!!
Here's my cluster.conf and drbd.conf for anyone who might be interested.
drbd.conf
global {
usage-count yes;
}
common {
protocol C;
}
resource drbd_disk {
on rhcsnode1 {
device /dev/drbd0;
disk /dev/hdc1;
address 10.10.10.100:7789;
meta-disk internal;
}
on rhcsnode2 {
device /dev/drbd0;
disk /dev/hdc1;
address 10.10.10.101:7789;
meta-disk internal;
}
}
Cluster.conf
<?xml version="1.0"?>
<cluster alias="pgsql_cluster" config_version="72" name="pgsql_cluster">
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="rhcsnode1.localdomain" nodeid="2" votes="1">
<fence/>
</clusternode>
<clusternode name="rhcsnode2.localdomain" nodeid="3" votes="1">
<fence/>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices/>
<rm>
<failoverdomains>
<failoverdomain name="fo_domain" nofailback="0" ordered="0" restricted="0"/>
</failoverdomains>
<resources>
<ip address="10.10.10.150" monitor_link="1"/>
<postgres-8 config_file="/etc/cluster/postgres-8/postgres-8:pgsql_db/postgresql.conf" name="pgsql_db" postmaster_user="postgres" shutdown_wait="3"/>
<fs device="/dev/drbd/by-res/drbd_disk" fstype="ext3" mountpoint="/var/lib/pgsql/data" name="fs_pgsql" options="noatime"/>
<drbd name="res_drbd" resource="drbd_disk"/>
</resources>
<service autostart="1" exclusive="1" name="mezeo_ha_db" recovery="relocate">
<ip ref="10.10.10.150"/>
<drbd ref="res_drbd">
<fs ref="fs_pgsql">
<postgres-8 ref="pgsql_db"/>
</fs>
</drbd>
</service>
</rm>
</cluster>
On Dec 9, 2009, at 4:11 PM, Gianluca Cecchi wrote:
> On Wed, Dec 9, 2009 at 9:03 PM, James Perry <jperry at mezeo.com> wrote:
>> Hi Gianluca,
>>
>> You have described exactly what I'm doing to test this. I kill the postmaster process. I've tried both a script resource as well as a postgres-8 resource but neither seem to work. What's odd is that I can relocate the service to either node without issue. It's only when I "Break" a node does the recovery fail.
>>
>> Could this be a CentOS version issue? I'm using CentOS 5.3.
>>
>> Let me try to clean up all my resources and try again. I'll also be installing a CentOS 5.4 to see if that matters.
>>
>> Would you by chance have an example cluster.conf file that has Postgres and DRBD?
>>
>> Thanks!
>>
>
> probably we are going off topic; sorry to the list...
> I have not managed PostgreSQL in a HA environment yet.
> But I think I was partially wrong inside my previous post.
>
> Important: Always return "0" if the status is non-fatal.
>
> So in both cases where you get 1 or 3, they are fatal.... and my
> answer was not correct
>
> I think you should read
> http://sources.redhat.com/cluster/wiki/FAQ/RGManager
>
> and in particular the sections regarding:
> The rgmanager keeps stopping and restarting
> mysql/named/ypserv/httpd/other script. Why?
> and eventually
> Can I have rgmanager behave differently based on the return code of my
> init script?
>
> One aim of rhel 5 was to have all the init scripts LSB compliant (in
> particular this means that a stop action against a not running service
> should always return 0)
> See https://bugzilla.redhat.com/show_bug.cgi?id=151104
>
> As they provide a custom ocf script for PostgreSQL, probably they
> didn't correct the standard init script
> Here @home, where I'm sitting now, I don't have CentOS/RHEL; but I
> have an F11 system and still I get on it:
> [root at tekkaman ~]# service postgresql stop
> Stopping postgresql service: [ OK ]
> [root at tekkaman ~]# echo $?
> 0
> [root at tekkaman ~]# service postgresql stop
> Stopping postgresql service: [FAILED]
> [root at tekkaman ~]# echo $?
> 1
>
> When you manually relocate, you start in a situation where you have
> the service running, so that the stop action succeeds and the same is
> true for the start action on the other node.
> When you kill postmaster, at the first check the status action fails,
> so that rgmanager is designed to
> 1) stop the service (I think to eventually clean things in case of
> improper termination of service, to release locks, and more
> importantly to protect from data corruption)
> 2) start the service (and its dependencies then if successfully) in
> the other node
>
> As the stop fails too, rgmanager gives up (probably you get your
> service in a FAILED state, I presume).
> It thinks this way: if I'm not able to cleanly stop the service,
> probably it is better not to try to start it on the other node.....
> DATA protection is first priority...
> and I agree wit this!
>
> So probably your choice is between:
> 1) manually modify postgresql standard init script
> 2) use the provided ocf resource postgres-8
>
> I warmly suggest 2).... did you try it?
> HIH,
> Gianluca
James Perry
Principal Consultant
Mezeo Software
t: 713.244.0859
f: 713.244.0851
m: 713.444.0251