[DRBD-user] oracle stop timeout while drbd resync

Thu Sep 1 04:32:25 CEST 2016

The environment has been recovered. I modified the pacemaker stop fail
action to "echo c >/proc/sysrq-trigger" so that  the system will be
reboot and generate vmcore when resource stop fail.
 I am sure that the reason is oracle stop action is stalled during
drbd resync. All the device used the same replication link.
Here is "foreach bt" resource in vmcore analysis:

PID: 6870   TASK: ffff8802c89b84c0  CPU: 14  COMMAND: "oracle"
 #0 [ffff880281bd79c8] schedule at ffffffff8145a489
 #1 [ffff880281bd7b10] do_get_write_access at ffffffffa02ae72d [jbd]
 #2 [ffff880281bd7bd0] journal_get_write_access at ffffffffa02ae899 [jbd]
 #3 [ffff880281bd7bf0] __ext3_journal_get_write_access at
ffffffffa0327aec [ext3]
 #4 [ffff880281bd7c20] ext3_reserve_inode_write at ffffffffa0317ef3 [ext3]
 #5 [ffff880281bd7c50] ext3_mark_inode_dirty at ffffffffa0318f71 [ext3]
 #6 [ffff880281bd7c90] ext3_dirty_inode at ffffffffa03190f7 [ext3]
 #7 [ffff880281bd7cb0] __mark_inode_dirty at ffffffff8117e7e0
 #8 [ffff880281bd7cf0] update_time at ffffffff81170c96
 #9 [ffff880281bd7d20] touch_atime at ffffffff81170efb
#10 [ffff880281bd7d60] generic_file_aio_read at ffffffff810f9e22
#11 [ffff880281bd7e20] aio_rw_vect_retry at ffffffff81199bb4
#12 [ffff880281bd7e50] aio_run_iocb at ffffffff8119b6c2
#13 [ffff880281bd7e80] io_submit_one at ffffffff8119c1f0
#14 [ffff880281bd7ec0] do_io_submit at ffffffff8119c3d8
#15 [ffff880281bd7f80] system_call_fastpath at ffffffff81464592
    RIP: 00007f38ad4c36f7  RSP: 00007fffc9ee77f0  RFLAGS: 00010206
    RAX: 00000000000000d1  RBX: ffffffff81464592  RCX: 0000000152012960
    RDX: 00007fffc9ee77c0  RSI: 0000000000000001  RDI: 00007f38af060000
    RBP: 0000000152012960   R8: 00007fffc9ee77b0   R9: 00007fffc9ee7750
    R10: 00007fffc9ee70d0  R11: 0000000000000206  R12: 00000001553e0f80
    R13: 00007f38ac571c60  R14: 00007fffc9ee77c0  R15: 00007fffc9ee77e0
    ORIG_RAX: 00000000000000d1  CS: 0033  SS: 002b

2016-09-01 7:48 GMT+08:00 Igor Cicimov <icicimov at gmail.com>:
>
>
> On Thu, Sep 1, 2016 at 9:02 AM, Igor Cicimov
> <igorc at encompasscorporation.com> wrote:
>>
>> On 1 Sep 2016 1:16 am, "Mia Lueng" <xiaozunvlg at gmail.com> wrote:
>> >
>> > Yes, Oracle & drbd is running under pacemaker just in
>> > primary/secondary mode. I stopped the oracle resource during DRBD is
>> > resyncing and the oracle hangup
>> >
>> > 2016-08-31 14:38 GMT+08:00 Igor Cicimov
>> > <igorc at encompasscorporation.com>:
>> > >
>> > >
>> > > On Wed, Aug 31, 2016 at 3:49 PM, Mia Lueng <xiaozunvlg at gmail.com>
>> > > wrote:
>> > >>
>> > >> Hi:
>> > >> I have a cluster with four drbd devices. I found oracle stopped
>> > >> timeout while drbd is in resync state.
>> > >> oracle is blocked like following:
>> > >>
>> > >> oracle    6869  6844  0.0  0.0 71424 12616 ?        S    16:28
>> > >> 00:00:00 pipe_wait
>> > >> /oracle/app/oracle/dbhome_1/bin/sqlplus
>> > >> @/tmp/ora_ommbb_shutdown.sql
>> > >> oracle    6870  6869  0.0  0.1 4431856 26096 ?       Ds   16:28
>> > >> 00:00:00 get_write_access                 oracleommbb
>> > >> (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
>> > >>
>> > >>
>> > >> drbd state
>> > >>
>> > >> 2016-08-30 16:33:32 Dump [/proc/drbd] ...
>> > >> =========================================
>> > >> version: 8.3.16 (api:88/proto:86-97)
>> > >> GIT-hash: bbf851ee755a878a495cfd93e1a76bf90dc79442 Makefile.in build
>> > >> by drbd at build 2012-06-07 16:03:04
>> > >> 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent B
>> > >> r-----
>> > >>   ns:2777568 nr:0 dw:492604 dr:3305833 al:4761 bm:439 lo:31 pe:613
>> > >> ua:0 ap:31 ep:1 wo:d oos:4144796
>> > >>                [======>.............] sync'ed: 35.7% (4044/6280)M
>> > >>                finish: 0:10:19 speed: 6,680 (3,664) K/sec
>> > >> 1: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent B
>> > >> r-----
>> > >>   ns:3709600 nr:0 dw:854764 dr:7632085 al:7689 bm:3401 lo:38 pe:3299
>> > >> ua:38 ap:0 ep:1 wo:d oos:6204676
>> > >>                [=======>............] sync'ed: 41.5% (6056/10340)M
>> > >>                finish: 0:22:14 speed: 4,640 (10,016) K/sec
>> > >> 2: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent B
>> > >> r-----
>> > >>   ns:3968883 nr:0 dw:127937 dr:5179641 al:190 bm:304 lo:1 pe:139 ua:0
>> > >> ap:7 ep:1 wo:d oos:2124792
>> > >>                [============>.......] sync'ed: 66.3% (2072/6144)M
>> > >>                finish: 0:06:12 speed: 5,692 (6,668) K/sec
>> > >> 3: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent B
>> > >> r-----
>> > >>   ns:89737 nr:0 dw:439073 dr:2235186 al:724 bm:35 lo:0 pe:45 ua:0
>> > >> ap:7
>> > >> ep:1 wo:d oos:8131104
>> > >>                [>....................] sync'ed:  1.6% (7940/8064)M
>> > >>                finish: 10:44:09 speed: 208 (204) K/sec (stalled)
>> > >>
>> > >> Is this a known bug and fixed in the further version?
>> > >> _______________________________________________
>> > >> drbd-user mailing list
>> > >> drbd-user at lists.linbit.com
>> > >> http://lists.linbit.com/mailman/listinfo/drbd-user
>> > >
>> > >
>> > > Maybe provide more details about the term "cluster" you are using. Do
>> > > you
>> > > have DRBD under control of crm like Pacemaker? If so are you running
>> > > DRBD in
>> > > dual primary mode maybe? And when does this state happen and under
>> > > what
>> > > conditions i.e restarted a node etc.
>>
>> What os is this on? Can you please paste the output of "crm status" (or
>> pcs if you are on rhel7) and "crm_mon -Qrf1"
>>
>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
>
> Another thing I forgot .... I find it odd that the sync for only one of the
> devices is stalled. Are they all using the same replication link? Any
> networking issues or network card errors you can see?