[DRBD-user] oracle on drbd failed

Tue Sep 4 10:48:42 CEST 2012

On Mon, Sep 03, 2012 at 10:59:12PM +0800, Mia Lueng wrote:
> resource drbd0{
>         protocol A;

Does it reproduce with different protocols as well?

>         disk
>         {
>                 on-io-error pass_on;

Certainly not.
	on-io-error detach;
please.

>                 no-disk-barrier;
>                 no-disk-flushes;
>         }

Did you verify that the config file (drbdadm dump),
and the kernel config (drbdsetup show) match?
On both nodes?

You confirm using 8.3.13 on both nodes?

>         syncer
>         {
>                 rate 100M;
>                 csums-alg md5;

You in theory could have md5 collisions.
Does it reproduce without csums-alg?

>                 verify-alg md5;

Use a verify-alg != csums-alg, (sha1 maybe)
change your testing method to stop services,
and do a verify after resync, while idle.
Then low-level compare
(dd iflag=direct bs=.. skip=... count=... ... | xxd)
the blocks that differ,
so we get an idea of what is actually different.

> #                c-plan-ahead 20;
>                 c-fill-target 0;
> #               c-delay-target 30;
> #                c-max-rate 200M;
>  #               c-min-rate 4M;
>         }
> 
>         net
>         {
> #               on-congestion pull-ahead;
> #               congestion-fill 128M;

Sorry for the "Did you try turning it off and on again" question, but,
did you at any point force-primary something?
If so, the data was simply not consistent,
and thus higher level data consistency errors would be expected.

>                 ping-timeout 30;
>                 ping-int 30;
>                 data-integrity-alg crc32c;
>         }
> 
>         on "kvm3.hgccp" {
>                 device    /dev/drbd0;
>                 disk      /dev/vg_kvm3/drbd0;
>                 address   192.168.10.6:7700;
>                 meta-disk  internal;
> 
>         }
> 
>         on "kvm4.hgccp" {
>                 device    /dev/drbd0;
>                 disk      /dev/vg_kvm4/drbd0;
>                 address   192.168.10.7:7700;
>                 meta-disk  internal;
> 
>         }
> }
> 
> os: rhel 6.3   x86_64, kernel version is 2.6.32-220.el6.x86_64
> 
> now we have not used proxy yet. I just test this on local lan
> environment.  If the test pass, we will install it on WAN enviroment.

> 2012/9/3 Lars Ellenberg <lars.ellenberg at linbit.com>:
> > On Sun, Aug 26, 2012 at 11:10:44AM +0800, Mia Lueng wrote:
> >> Hi All:
> >>
> >> I built a cluster to protect oracle database. The oracle db file
> >> stored on the drbd(8.3.13)  device using protocol A.  But sometime
> >> oracle can not be failover  when the primary node is down. Here is the
> >> testing step

Once you get the failure, is that persistent,
or does it work again after the next resync?

Does drbd online verify while idle find differing blocks?
See above, if so, low-level compare them.

> > Please show the drbd configuration
> > (drbdadm dump, or even better, drbdsetup 0 show)
> > and cat /proc/drbd.
> >
> > Also, what is the kernel version, distribution/platform?
> > What does the rest of the IO stack look like?
> >
> >> How can I avoid there errors and let oracle be failover at any time
> >> the primary node crash?  Thanks.

"Works here".

> >> BTW: protocol A is needed because the cluster running  WAN  and using a proxy.
> >
> > Proxy, as in "drbd-proxy"?
> > Why not contact your LINBIT support channel, then?

^^ That would still an option...

 ;)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed