[DRBD-user] pacemaker + corosync and postgresql

Lars Ellenberg lars.ellenberg at linbit.com
Mon Oct 14 17:43:00 CEST 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Mon, Oct 14, 2013 at 11:15:31AM -0300, Thomaz Luiz Santos wrote:
> > You should *not* start DRBD from the init script.
> >  # chkconfig drbd off
> 
> *** OK remove start on the boot
> 
> 
> > You should *NOT* configure "no-disk-drain".
> > It is likely to corrupt your data.
> ** OK removed the disk drain from postgresql.res
> 
> # cat postgresql.res
> resource postgresql {
>   startup {
>     wfc-timeout 15;
>     degr-wfc-timeout 60;
>   }
> 
>   syncer {
>     rate 150M;
>     verify-alg md5;
>   }
> 
>   on ha-master {
>      device /dev/drbd0;
>      disk /dev/sdb1;
>      address 172.70.65.210:7788;
>      meta-disk internal;
>   }
> 
>   on ha-slave {
>      device /dev/drbd0;
>      disk /dev/sdb1;
>      address 172.70.65.220:7788;
>      meta-disk internal;
>  }
> 
> }
> 
> 
> > You should configure monitoring ops for DRBD.
> > One each for Master and Slave role, with different intervals.
> 
> ** how i can do that ??
> 
> from:
> http://www.drbd.org/users-guide-9.0/s-pacemaker-crm-drbd-backed-service.html
> 
> crm(live)configure# primitive drbd_mysql ocf:linbit:drbd \
>                     params drbd_resource="mysql" \
>                     op monitor interval="29s" role="Master" \
>                     op monitor interval="31s" role="Slave"
> 
> 
> > You probably need to "crm resource cleanup ..." a bit.
> ** remake the crm configure
> 
> crm(live)# configure
> crm(live)configure# show
> node ha-master
> node ha-slave
> primitive drbd_postgresql ocf:heartbeat:drbd \
>         params drbd_resource="postgresql" \
>         op monitor interval="29s" role="Master" \
>         op monitor interval="31s" role="Slave"
> primitive fs_postgresql ocf:heartbeat:Filesystem \
>         params device="/dev/drbd0" directory="/mnt" fstype="ext4"
> primitive postgresqld lsb:postgresql
> primitive vip_cluster ocf:heartbeat:IPaddr2 \
>         params ip="172.70.65.200" nic="eth0:1"
> group postgresql fs_postgresql vip_cluster postgresqld
> ms ms_drbd_postgresql drbd_postgresql \
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> colocation postgresql_on_drbd inf: postgresql ms_drbd_postgresql:Master
> order postgresql_after_drbd inf: ms_drbd_postgresql:promote postgresql:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         no-quorum-policy="ignore" \
>         stonith-enabled="false"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="100"
> 
> 
> > You may need to manually remove fencing constraints (if DRBD finished
> > the resync when no pacemaker was running yet, it would not have been
> > able to remove it from its handler).
> 
> ** how i do that ???
> ** only in the ha-master ??
> # drbdadmin create-md postgresql
> # drbdadmin up postgresql
> # drbdadmin -- --overwrite-data-of-peer primary postgresql

Nope.
cat /proc/drbd
crm configure show

   drbd UpToDate/UpToDate, but
   fencing contraints present?
   --> delete those fencing constraints
   crm configure
       delete id-of-those-contraints...

> You may need to *read the logs*.
> The answer will be in there.
> 
> > You have:
> > Failed actions:
> >     drbd_postgresql:0_start_0 (node=ha-slave, call=14, rc=1,
> > status=complete): unknown error
> 
> > So start looking for that, and see what it complains about.
> 
> # cat /var/log/syslog | grep drbd_postgresql
> *** Syslog
> 
> Oct 14 11:10:08 ha-master pengine: [786]: debug: unpack_rsc_op: drbd_postgresql:1_last_failure_0 on ha-slave returned 1 (unknown error) instead of the expected value: 0 (ok)
> Oct 14 11:10:08 ha-master pengine: [786]: WARN: unpack_rsc_op: Processing failed op drbd_postgresql:1_last_failure_0 on ha-slave: unknown error (1)
> Oct 14 11:10:08 ha-master pengine: [786]: debug: unpack_rsc_op: drbd_postgresql:0_last_failure_0 on ha-master returned 1 (unknown error) instead of the expected value: 0 (ok)
> Oct 14 11:10:08 ha-master pengine: [786]: WARN: unpack_rsc_op: Processing failed op drbd_postgresql:0_last_failure_0 on ha-master: unknown error (1)
> Oct 14 11:10:08 ha-master pengine: [786]: info: clone_print:  Master/Slave Set: ms_drbd_postgresql [drbd_postgresql]
> Oct 14 11:10:08 ha-master pengine: [786]: info: short_print:      Stopped: [ drbd_postgresql:0 drbd_postgresql:1 ]
> Oct 14 11:10:08 ha-master pengine: [786]: info: get_failcount: ms_drbd_postgresql has failed INFINITY times on ha-slave
> Oct 14 11:10:08 ha-master pengine: [786]: WARN: common_apply_stickiness: Forcing ms_drbd_postgresql away from ha-slave after 1000000 failures (max=1000000)

Boring :-(
These logs just say that "there was a fatal error before,
and that's why I won't even try again".
And they will tell you this repeatedly, everytime
the policy engine runs.

You need to "crm resource cleanup ...",
then watch the actual start attempt fail.

(Or look in the logs for the last real start attempt,
 and its failure)


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list