[DRBD-user] DRBD + RHCS - Failover not working

Wed Dec 9 21:03:31 CET 2009

Hi Gianluca,

You have described exactly what I'm doing to test this.  I kill the postmaster process.  I've tried both a script resource as well as a postgres-8 resource but neither seem to work.  What's odd is that I can relocate the service to either node without issue.  It's only when I "Break" a node does the recovery fail.

Could this be a CentOS version issue?  I'm using CentOS 5.3.

Let me try to clean up all my resources and try again.  I'll also be installing a CentOS 5.4 to see if that matters.

Would you by chance have an example cluster.conf file that has Postgres and DRBD?

Thanks!

On Dec 9, 2009, at 4:46 AM, Gianluca Cecchi wrote:

> On Wed, Dec 9, 2009 at 2:02 AM, James Perry <jperry at mezeo.com> wrote:
> [snip]
>> Here's the error I'm getting... It appears that DRBD is failing but I can't tell why.
>> 
>> Dec  7 14:34:49 rhcsnode1 clurgmgrd[8024]: <notice> Service service:mezeo_ha_db started
>> Dec  7 14:36:36 rhcsnode1 clurgmgrd: [8024]: <err> script:pgsql_svc: status of /etc/rc.d/init.d/postgresql failed (returned 1)
>> Dec  7 14:36:36 rhcsnode1 clurgmgrd[8024]: <notice> status on script "pgsql_svc" returned 1 (generic error)
>> Dec  7 14:36:36 rhcsnode1 clurgmgrd[8024]: <notice> Stopping service service:mezeo_ha_db
>> Dec  7 14:36:37 rhcsnode1 clurgmgrd: [8024]: <err> script:pgsql_svc: stop of /etc/rc.d/init.d/postgresql failed (returned 1)
>> Dec  7 14:36:37 rhcsnode1 clurgmgrd[8024]: <notice> stop on script "pgsql_svc" returned 1 (generic error)
> 
> From what you posted, one can only deduce that your
> /etc/rc.d/init.d/postgresql script is perhaps not conforming with what
> expected.
> In fact clurgmgrd is not able to evaluate the result of postgresql status:
> script:pgsql_svc: status of /etc/rc.d/init.d/postgresql failed (returned 1)
> 
> Does this depend on you killing postmaster process or other similar? I
> don't think so...
> On a test server with CentOS 5.4 and a clean postgresql-server
> installed, even if I do a kill -9 of the postmaster pid, so that I
> have the file /var/run/postmaster.5432.pid without the process itself,
> a
> service postgresql status gives
> [root at c54vm1 ~]# service postgresql status
> postmaster is stopped
> [root at c54vm1 ~]# echo $?
> 3
> 
> (see also /etc/rc.d/init.d/functions)
> 
> This should be returned to rhcs when a service is not running, AFAIK.
> 
> So, coming back to your system, clurgmgrd decides to stop the service,
> because it is not able to evaluate it (again giving an error ...):
> script:pgsql_svc: stop of /etc/rc.d/init.d/postgresql failed (returned 1)
> 
> Note also these:
> The following rules apply to parent/child relationships in a resource tree:
> • Parents are started before children.
> • Children must all stop cleanly before a parent may be stopped.
> • For a resource to be considered in good health, all its children
> must be in good health.
> 
> HIH,
> Gianluca
> 
> PS: you have the default resource provided by rhcs for postgresql in
> resource section, but you are using standard postgresql init script in
> service section as an external script... any reason?

James Perry
Principal Consultant
Mezeo Software
t: 713.244.0859
f: 713.244.0851
m: 713.444.0251