[DRBD-user] DRBD + RHCS - Failover not working

Wed Dec 9 23:11:54 CET 2009

On Wed, Dec 9, 2009 at 9:03 PM, James Perry <jperry at mezeo.com> wrote:
> Hi Gianluca,
>
> You have described exactly what I'm doing to test this.  I kill the postmaster process.  I've tried both a script resource as well as a postgres-8 resource but neither seem to work.  What's odd is that I can relocate the service to either node without issue.  It's only when I "Break" a node does the recovery fail.
>
> Could this be a CentOS version issue?  I'm using CentOS 5.3.
>
> Let me try to clean up all my resources and try again.  I'll also be installing a CentOS 5.4 to see if that matters.
>
> Would you by chance have an example cluster.conf file that has Postgres and DRBD?
>
> Thanks!
>

probably we are going off topic; sorry to the list...
I have not managed PostgreSQL in a HA environment yet.
But I think I was partially wrong inside my previous post.

Important: Always return "0" if the status is non-fatal.

So in both cases where you get 1 or 3, they are fatal.... and my
answer was not correct

I think you should read
http://sources.redhat.com/cluster/wiki/FAQ/RGManager

and in particular the sections regarding:
The rgmanager keeps stopping and restarting
mysql/named/ypserv/httpd/other script. Why?
and eventually
Can I have rgmanager behave differently based on the return code of my
init script?

One aim of rhel 5 was to have all the init scripts LSB compliant (in
particular this means that a stop action against a not running service
should always return 0)
See https://bugzilla.redhat.com/show_bug.cgi?id=151104

As they provide a custom ocf script for PostgreSQL, probably they
didn't correct the standard init script
Here @home, where I'm sitting now, I don't have CentOS/RHEL; but I
have an F11 system and still I get on it:
[root at tekkaman ~]# service postgresql stop
Stopping postgresql service:                               [  OK  ]
[root at tekkaman ~]# echo $?
0
[root at tekkaman ~]# service postgresql stop
Stopping postgresql service:                               [FAILED]
[root at tekkaman ~]# echo $?
1

When you manually relocate, you start in a situation where you have
the service running, so that the stop action succeeds and the same is
true for the start action on the other node.
When you kill postmaster, at the first check the status action fails,
so that rgmanager is designed to
1) stop the service (I think to eventually clean things in case of
improper termination of service, to release locks, and more
importantly to protect from data corruption)
2) start the service (and its dependencies then if successfully) in
the other node

As the stop fails too, rgmanager gives up (probably you get your
service in a FAILED state, I presume).
It thinks this way: if I'm not able to cleanly stop the service,
probably it is better not to try to start it on the other node.....
DATA protection is first priority...
and I agree wit this!

So probably your choice is between:
1) manually modify postgresql standard init script
2) use the provided ocf resource postgres-8

I warmly suggest 2).... did you try it?
HIH,
Gianluca