[DRBD-user] oracle stop timeout while drbd resync

Thu Sep 1 11:34:53 CEST 2016

On Thu, Sep 01, 2016 at 10:32:25AM +0800, Mia Lueng wrote:
> The environment has been recovered. I modified the pacemaker stop fail
> action to "echo c >/proc/sysrq-trigger" so that  the system will be
> reboot and generate vmcore when resource stop fail.

Uhm. "Stop fail" was simply a timeout, I assume?
Why would a crash dump help,
if stuff takes longer than expected?

Data base shutdown will cause "flushing" of whatever the database thinks
still needs to go to the logs, journals, table spaces... May cause a lot
of random IO, may take some time.

If DRBD is resynchronizing, that is basically "heavy streaming IO",
both on the IO backends and on the replication link.

Depending on configuration of DRBD, and of the system overall,
that will only slightly delay, but may well seriously delay,
and application IO.

If all your DRBD resources are backed by the same spindle,
you should not have them resync concurrently, that would
cause thrashing, and might overtax your IO subsystem.

With DRBD 8.4 (currently we are at 8.4.8), you get more knobs for
tuning "priority" of application vs resync IO.  But even with your
(8.3.something, iirc?) version, you could tune your IO subsystem
and DRBD, and of course your pacemaker stop timeouts.

You may also want to mount "noatime".

My educated best guess:
problem with too short timeouts, and not enough oomph in
network and IO backend, as well as suboptimal system
and DRBD configuration.

    Lars

> 2016-09-01 7:48 GMT+08:00 Igor Cicimov <icicimov at gmail.com>:
> >
> >
> > On Thu, Sep 1, 2016 at 9:02 AM, Igor Cicimov
> > <igorc at encompasscorporation.com> wrote:
> >>
> >> On 1 Sep 2016 1:16 am, "Mia Lueng" <xiaozunvlg at gmail.com> wrote:
> >> >
> >> > Yes, Oracle & drbd is running under pacemaker just in
> >> > primary/secondary mode. I stopped the oracle resource during DRBD is
> >> > resyncing and the oracle hangup
> >> >
> >> > 2016-08-31 14:38 GMT+08:00 Igor Cicimov
> >> > <igorc at encompasscorporation.com>:
> >> > >
> >> > >
> >> > > On Wed, Aug 31, 2016 at 3:49 PM, Mia Lueng <xiaozunvlg at gmail.com>
> >> > > wrote:
> >> > >>
> >> > >> Hi:
> >> > >> I have a cluster with four drbd devices. I found oracle stopped
> >> > >> timeout while drbd is in resync state.
> >> > >> oracle is blocked like following:
> >> > >>
> >> > >> oracle    6869  6844  0.0  0.0 71424 12616 ?        S    16:28
> >> > >> 00:00:00 pipe_wait
> >> > >> /oracle/app/oracle/dbhome_1/bin/sqlplus
> >> > >> @/tmp/ora_ommbb_shutdown.sql
> >> > >> oracle    6870  6869  0.0  0.1 4431856 26096 ?       Ds   16:28
> >> > >> 00:00:00 get_write_access                 oracleommbb
> >> > >> (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
> >> > >>
> >> > >>
> >> > >> drbd state
> >> > >>
> >> > >> 2016-08-30 16:33:32 Dump [/proc/drbd] ...
> >> > >> =========================================
> >> > >> version: 8.3.16 (api:88/proto:86-97)
> >> > >> GIT-hash: bbf851ee755a878a495cfd93e1a76bf90dc79442 Makefile.in build
> >> > >> by drbd at build 2012-06-07 16:03:04
> >> > >> 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent B
> >> > >> r-----
> >> > >>   ns:2777568 nr:0 dw:492604 dr:3305833 al:4761 bm:439 lo:31 pe:613
> >> > >> ua:0 ap:31 ep:1 wo:d oos:4144796
> >> > >>                [======>.............] sync'ed: 35.7% (4044/6280)M
> >> > >>                finish: 0:10:19 speed: 6,680 (3,664) K/sec
> >> > >> 1: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent B
> >> > >> r-----
> >> > >>   ns:3709600 nr:0 dw:854764 dr:7632085 al:7689 bm:3401 lo:38 pe:3299
> >> > >> ua:38 ap:0 ep:1 wo:d oos:6204676
> >> > >>                [=======>............] sync'ed: 41.5% (6056/10340)M
> >> > >>                finish: 0:22:14 speed: 4,640 (10,016) K/sec
> >> > >> 2: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent B
> >> > >> r-----
> >> > >>   ns:3968883 nr:0 dw:127937 dr:5179641 al:190 bm:304 lo:1 pe:139 ua:0
> >> > >> ap:7 ep:1 wo:d oos:2124792
> >> > >>                [============>.......] sync'ed: 66.3% (2072/6144)M
> >> > >>                finish: 0:06:12 speed: 5,692 (6,668) K/sec
> >> > >> 3: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent B
> >> > >> r-----
> >> > >>   ns:89737 nr:0 dw:439073 dr:2235186 al:724 bm:35 lo:0 pe:45 ua:0
> >> > >> ap:7
> >> > >> ep:1 wo:d oos:8131104
> >> > >>                [>....................] sync'ed:  1.6% (7940/8064)M
> >> > >>                finish: 10:44:09 speed: 208 (204) K/sec (stalled)
> >> > >>
> >> > >> Is this a known bug and fixed in the further version?

> >> > >
> >> > > Maybe provide more details about the term "cluster" you are using. Do
> >> > > you
> >> > > have DRBD under control of crm like Pacemaker? If so are you running
> >> > > DRBD in
> >> > > dual primary mode maybe? And when does this state happen and under
> >> > > what
> >> > > conditions i.e restarted a node etc.
> >>
> >> What os is this on? Can you please paste the output of "crm status" (or
> >> pcs if you are on rhel7) and "crm_mon -Qrf1"
> >>
> >
> > Another thing I forgot .... I find it odd that the sync for only one of the
> > devices is stalled. Are they all using the same replication link? Any
> > networking issues or network card errors you can see?

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed