Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
The environment has been recovered. I modified the pacemaker stop fail
action to "echo c >/proc/sysrq-trigger" so that the system will be
reboot and generate vmcore when resource stop fail.
I am sure that the reason is oracle stop action is stalled during
drbd resync. All the device used the same replication link.
Here is "foreach bt" resource in vmcore analysis:
PID: 6870 TASK: ffff8802c89b84c0 CPU: 14 COMMAND: "oracle"
#0 [ffff880281bd79c8] schedule at ffffffff8145a489
#1 [ffff880281bd7b10] do_get_write_access at ffffffffa02ae72d [jbd]
#2 [ffff880281bd7bd0] journal_get_write_access at ffffffffa02ae899 [jbd]
#3 [ffff880281bd7bf0] __ext3_journal_get_write_access at
ffffffffa0327aec [ext3]
#4 [ffff880281bd7c20] ext3_reserve_inode_write at ffffffffa0317ef3 [ext3]
#5 [ffff880281bd7c50] ext3_mark_inode_dirty at ffffffffa0318f71 [ext3]
#6 [ffff880281bd7c90] ext3_dirty_inode at ffffffffa03190f7 [ext3]
#7 [ffff880281bd7cb0] __mark_inode_dirty at ffffffff8117e7e0
#8 [ffff880281bd7cf0] update_time at ffffffff81170c96
#9 [ffff880281bd7d20] touch_atime at ffffffff81170efb
#10 [ffff880281bd7d60] generic_file_aio_read at ffffffff810f9e22
#11 [ffff880281bd7e20] aio_rw_vect_retry at ffffffff81199bb4
#12 [ffff880281bd7e50] aio_run_iocb at ffffffff8119b6c2
#13 [ffff880281bd7e80] io_submit_one at ffffffff8119c1f0
#14 [ffff880281bd7ec0] do_io_submit at ffffffff8119c3d8
#15 [ffff880281bd7f80] system_call_fastpath at ffffffff81464592
RIP: 00007f38ad4c36f7 RSP: 00007fffc9ee77f0 RFLAGS: 00010206
RAX: 00000000000000d1 RBX: ffffffff81464592 RCX: 0000000152012960
RDX: 00007fffc9ee77c0 RSI: 0000000000000001 RDI: 00007f38af060000
RBP: 0000000152012960 R8: 00007fffc9ee77b0 R9: 00007fffc9ee7750
R10: 00007fffc9ee70d0 R11: 0000000000000206 R12: 00000001553e0f80
R13: 00007f38ac571c60 R14: 00007fffc9ee77c0 R15: 00007fffc9ee77e0
ORIG_RAX: 00000000000000d1 CS: 0033 SS: 002b
2016-09-01 7:48 GMT+08:00 Igor Cicimov <icicimov at gmail.com>:
>
>
> On Thu, Sep 1, 2016 at 9:02 AM, Igor Cicimov
> <igorc at encompasscorporation.com> wrote:
>>
>> On 1 Sep 2016 1:16 am, "Mia Lueng" <xiaozunvlg at gmail.com> wrote:
>> >
>> > Yes, Oracle & drbd is running under pacemaker just in
>> > primary/secondary mode. I stopped the oracle resource during DRBD is
>> > resyncing and the oracle hangup
>> >
>> > 2016-08-31 14:38 GMT+08:00 Igor Cicimov
>> > <igorc at encompasscorporation.com>:
>> > >
>> > >
>> > > On Wed, Aug 31, 2016 at 3:49 PM, Mia Lueng <xiaozunvlg at gmail.com>
>> > > wrote:
>> > >>
>> > >> Hi:
>> > >> I have a cluster with four drbd devices. I found oracle stopped
>> > >> timeout while drbd is in resync state.
>> > >> oracle is blocked like following:
>> > >>
>> > >> oracle 6869 6844 0.0 0.0 71424 12616 ? S 16:28
>> > >> 00:00:00 pipe_wait
>> > >> /oracle/app/oracle/dbhome_1/bin/sqlplus
>> > >> @/tmp/ora_ommbb_shutdown.sql
>> > >> oracle 6870 6869 0.0 0.1 4431856 26096 ? Ds 16:28
>> > >> 00:00:00 get_write_access oracleommbb
>> > >> (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
>> > >>
>> > >>
>> > >> drbd state
>> > >>
>> > >> 2016-08-30 16:33:32 Dump [/proc/drbd] ...
>> > >> =========================================
>> > >> version: 8.3.16 (api:88/proto:86-97)
>> > >> GIT-hash: bbf851ee755a878a495cfd93e1a76bf90dc79442 Makefile.in build
>> > >> by drbd at build 2012-06-07 16:03:04
>> > >> 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent B
>> > >> r-----
>> > >> ns:2777568 nr:0 dw:492604 dr:3305833 al:4761 bm:439 lo:31 pe:613
>> > >> ua:0 ap:31 ep:1 wo:d oos:4144796
>> > >> [======>.............] sync'ed: 35.7% (4044/6280)M
>> > >> finish: 0:10:19 speed: 6,680 (3,664) K/sec
>> > >> 1: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent B
>> > >> r-----
>> > >> ns:3709600 nr:0 dw:854764 dr:7632085 al:7689 bm:3401 lo:38 pe:3299
>> > >> ua:38 ap:0 ep:1 wo:d oos:6204676
>> > >> [=======>............] sync'ed: 41.5% (6056/10340)M
>> > >> finish: 0:22:14 speed: 4,640 (10,016) K/sec
>> > >> 2: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent B
>> > >> r-----
>> > >> ns:3968883 nr:0 dw:127937 dr:5179641 al:190 bm:304 lo:1 pe:139 ua:0
>> > >> ap:7 ep:1 wo:d oos:2124792
>> > >> [============>.......] sync'ed: 66.3% (2072/6144)M
>> > >> finish: 0:06:12 speed: 5,692 (6,668) K/sec
>> > >> 3: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent B
>> > >> r-----
>> > >> ns:89737 nr:0 dw:439073 dr:2235186 al:724 bm:35 lo:0 pe:45 ua:0
>> > >> ap:7
>> > >> ep:1 wo:d oos:8131104
>> > >> [>....................] sync'ed: 1.6% (7940/8064)M
>> > >> finish: 10:44:09 speed: 208 (204) K/sec (stalled)
>> > >>
>> > >> Is this a known bug and fixed in the further version?
>> > >> _______________________________________________
>> > >> drbd-user mailing list
>> > >> drbd-user at lists.linbit.com
>> > >> http://lists.linbit.com/mailman/listinfo/drbd-user
>> > >
>> > >
>> > > Maybe provide more details about the term "cluster" you are using. Do
>> > > you
>> > > have DRBD under control of crm like Pacemaker? If so are you running
>> > > DRBD in
>> > > dual primary mode maybe? And when does this state happen and under
>> > > what
>> > > conditions i.e restarted a node etc.
>>
>> What os is this on? Can you please paste the output of "crm status" (or
>> pcs if you are on rhel7) and "crm_mon -Qrf1"
>>
>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>
>
> Another thing I forgot .... I find it odd that the sync for only one of the
> devices is stalled. Are they all using the same replication link? Any
> networking issues or network card errors you can see?