Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Gianluca, I'm pleased to announce I now have it working!!!! Yay!!! Turns out the DRBD configuration was just fine. I was a combination of using the right resource type (postgres-8) and the order/dependencies of the resources. Thanks for your help!!! Here's my cluster.conf and drbd.conf for anyone who might be interested. drbd.conf global { usage-count yes; } common { protocol C; } resource drbd_disk { on rhcsnode1 { device /dev/drbd0; disk /dev/hdc1; address 10.10.10.100:7789; meta-disk internal; } on rhcsnode2 { device /dev/drbd0; disk /dev/hdc1; address 10.10.10.101:7789; meta-disk internal; } } Cluster.conf <?xml version="1.0"?> <cluster alias="pgsql_cluster" config_version="72" name="pgsql_cluster"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="rhcsnode1.localdomain" nodeid="2" votes="1"> <fence/> </clusternode> <clusternode name="rhcsnode2.localdomain" nodeid="3" votes="1"> <fence/> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices/> <rm> <failoverdomains> <failoverdomain name="fo_domain" nofailback="0" ordered="0" restricted="0"/> </failoverdomains> <resources> <ip address="10.10.10.150" monitor_link="1"/> <postgres-8 config_file="/etc/cluster/postgres-8/postgres-8:pgsql_db/postgresql.conf" name="pgsql_db" postmaster_user="postgres" shutdown_wait="3"/> <fs device="/dev/drbd/by-res/drbd_disk" fstype="ext3" mountpoint="/var/lib/pgsql/data" name="fs_pgsql" options="noatime"/> <drbd name="res_drbd" resource="drbd_disk"/> </resources> <service autostart="1" exclusive="1" name="mezeo_ha_db" recovery="relocate"> <ip ref="10.10.10.150"/> <drbd ref="res_drbd"> <fs ref="fs_pgsql"> <postgres-8 ref="pgsql_db"/> </fs> </drbd> </service> </rm> </cluster> On Dec 9, 2009, at 4:11 PM, Gianluca Cecchi wrote: > On Wed, Dec 9, 2009 at 9:03 PM, James Perry <jperry at mezeo.com> wrote: >> Hi Gianluca, >> >> You have described exactly what I'm doing to test this. I kill the postmaster process. I've tried both a script resource as well as a postgres-8 resource but neither seem to work. What's odd is that I can relocate the service to either node without issue. It's only when I "Break" a node does the recovery fail. >> >> Could this be a CentOS version issue? I'm using CentOS 5.3. >> >> Let me try to clean up all my resources and try again. I'll also be installing a CentOS 5.4 to see if that matters. >> >> Would you by chance have an example cluster.conf file that has Postgres and DRBD? >> >> Thanks! >> > > probably we are going off topic; sorry to the list... > I have not managed PostgreSQL in a HA environment yet. > But I think I was partially wrong inside my previous post. > > Important: Always return "0" if the status is non-fatal. > > So in both cases where you get 1 or 3, they are fatal.... and my > answer was not correct > > I think you should read > http://sources.redhat.com/cluster/wiki/FAQ/RGManager > > and in particular the sections regarding: > The rgmanager keeps stopping and restarting > mysql/named/ypserv/httpd/other script. Why? > and eventually > Can I have rgmanager behave differently based on the return code of my > init script? > > One aim of rhel 5 was to have all the init scripts LSB compliant (in > particular this means that a stop action against a not running service > should always return 0) > See https://bugzilla.redhat.com/show_bug.cgi?id=151104 > > As they provide a custom ocf script for PostgreSQL, probably they > didn't correct the standard init script > Here @home, where I'm sitting now, I don't have CentOS/RHEL; but I > have an F11 system and still I get on it: > [root at tekkaman ~]# service postgresql stop > Stopping postgresql service: [ OK ] > [root at tekkaman ~]# echo $? > 0 > [root at tekkaman ~]# service postgresql stop > Stopping postgresql service: [FAILED] > [root at tekkaman ~]# echo $? > 1 > > When you manually relocate, you start in a situation where you have > the service running, so that the stop action succeeds and the same is > true for the start action on the other node. > When you kill postmaster, at the first check the status action fails, > so that rgmanager is designed to > 1) stop the service (I think to eventually clean things in case of > improper termination of service, to release locks, and more > importantly to protect from data corruption) > 2) start the service (and its dependencies then if successfully) in > the other node > > As the stop fails too, rgmanager gives up (probably you get your > service in a FAILED state, I presume). > It thinks this way: if I'm not able to cleanly stop the service, > probably it is better not to try to start it on the other node..... > DATA protection is first priority... > and I agree wit this! > > So probably your choice is between: > 1) manually modify postgresql standard init script > 2) use the provided ocf resource postgres-8 > > I warmly suggest 2).... did you try it? > HIH, > Gianluca James Perry Principal Consultant Mezeo Software t: 713.244.0859 f: 713.244.0851 m: 713.444.0251