Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
The environment has been recovered. I modified the pacemaker stop fail action to "echo c >/proc/sysrq-trigger" so that the system will be reboot and generate vmcore when resource stop fail. I am sure that the reason is oracle stop action is stalled during drbd resync. All the device used the same replication link. Here is "foreach bt" resource in vmcore analysis: PID: 6870 TASK: ffff8802c89b84c0 CPU: 14 COMMAND: "oracle" #0 [ffff880281bd79c8] schedule at ffffffff8145a489 #1 [ffff880281bd7b10] do_get_write_access at ffffffffa02ae72d [jbd] #2 [ffff880281bd7bd0] journal_get_write_access at ffffffffa02ae899 [jbd] #3 [ffff880281bd7bf0] __ext3_journal_get_write_access at ffffffffa0327aec [ext3] #4 [ffff880281bd7c20] ext3_reserve_inode_write at ffffffffa0317ef3 [ext3] #5 [ffff880281bd7c50] ext3_mark_inode_dirty at ffffffffa0318f71 [ext3] #6 [ffff880281bd7c90] ext3_dirty_inode at ffffffffa03190f7 [ext3] #7 [ffff880281bd7cb0] __mark_inode_dirty at ffffffff8117e7e0 #8 [ffff880281bd7cf0] update_time at ffffffff81170c96 #9 [ffff880281bd7d20] touch_atime at ffffffff81170efb #10 [ffff880281bd7d60] generic_file_aio_read at ffffffff810f9e22 #11 [ffff880281bd7e20] aio_rw_vect_retry at ffffffff81199bb4 #12 [ffff880281bd7e50] aio_run_iocb at ffffffff8119b6c2 #13 [ffff880281bd7e80] io_submit_one at ffffffff8119c1f0 #14 [ffff880281bd7ec0] do_io_submit at ffffffff8119c3d8 #15 [ffff880281bd7f80] system_call_fastpath at ffffffff81464592 RIP: 00007f38ad4c36f7 RSP: 00007fffc9ee77f0 RFLAGS: 00010206 RAX: 00000000000000d1 RBX: ffffffff81464592 RCX: 0000000152012960 RDX: 00007fffc9ee77c0 RSI: 0000000000000001 RDI: 00007f38af060000 RBP: 0000000152012960 R8: 00007fffc9ee77b0 R9: 00007fffc9ee7750 R10: 00007fffc9ee70d0 R11: 0000000000000206 R12: 00000001553e0f80 R13: 00007f38ac571c60 R14: 00007fffc9ee77c0 R15: 00007fffc9ee77e0 ORIG_RAX: 00000000000000d1 CS: 0033 SS: 002b 2016-09-01 7:48 GMT+08:00 Igor Cicimov <icicimov at gmail.com>: > > > On Thu, Sep 1, 2016 at 9:02 AM, Igor Cicimov > <igorc at encompasscorporation.com> wrote: >> >> On 1 Sep 2016 1:16 am, "Mia Lueng" <xiaozunvlg at gmail.com> wrote: >> > >> > Yes, Oracle & drbd is running under pacemaker just in >> > primary/secondary mode. I stopped the oracle resource during DRBD is >> > resyncing and the oracle hangup >> > >> > 2016-08-31 14:38 GMT+08:00 Igor Cicimov >> > <igorc at encompasscorporation.com>: >> > > >> > > >> > > On Wed, Aug 31, 2016 at 3:49 PM, Mia Lueng <xiaozunvlg at gmail.com> >> > > wrote: >> > >> >> > >> Hi: >> > >> I have a cluster with four drbd devices. I found oracle stopped >> > >> timeout while drbd is in resync state. >> > >> oracle is blocked like following: >> > >> >> > >> oracle 6869 6844 0.0 0.0 71424 12616 ? S 16:28 >> > >> 00:00:00 pipe_wait >> > >> /oracle/app/oracle/dbhome_1/bin/sqlplus >> > >> @/tmp/ora_ommbb_shutdown.sql >> > >> oracle 6870 6869 0.0 0.1 4431856 26096 ? Ds 16:28 >> > >> 00:00:00 get_write_access oracleommbb >> > >> (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))) >> > >> >> > >> >> > >> drbd state >> > >> >> > >> 2016-08-30 16:33:32 Dump [/proc/drbd] ... >> > >> ========================================= >> > >> version: 8.3.16 (api:88/proto:86-97) >> > >> GIT-hash: bbf851ee755a878a495cfd93e1a76bf90dc79442 Makefile.in build >> > >> by drbd at build 2012-06-07 16:03:04 >> > >> 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent B >> > >> r----- >> > >> ns:2777568 nr:0 dw:492604 dr:3305833 al:4761 bm:439 lo:31 pe:613 >> > >> ua:0 ap:31 ep:1 wo:d oos:4144796 >> > >> [======>.............] sync'ed: 35.7% (4044/6280)M >> > >> finish: 0:10:19 speed: 6,680 (3,664) K/sec >> > >> 1: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent B >> > >> r----- >> > >> ns:3709600 nr:0 dw:854764 dr:7632085 al:7689 bm:3401 lo:38 pe:3299 >> > >> ua:38 ap:0 ep:1 wo:d oos:6204676 >> > >> [=======>............] sync'ed: 41.5% (6056/10340)M >> > >> finish: 0:22:14 speed: 4,640 (10,016) K/sec >> > >> 2: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent B >> > >> r----- >> > >> ns:3968883 nr:0 dw:127937 dr:5179641 al:190 bm:304 lo:1 pe:139 ua:0 >> > >> ap:7 ep:1 wo:d oos:2124792 >> > >> [============>.......] sync'ed: 66.3% (2072/6144)M >> > >> finish: 0:06:12 speed: 5,692 (6,668) K/sec >> > >> 3: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent B >> > >> r----- >> > >> ns:89737 nr:0 dw:439073 dr:2235186 al:724 bm:35 lo:0 pe:45 ua:0 >> > >> ap:7 >> > >> ep:1 wo:d oos:8131104 >> > >> [>....................] sync'ed: 1.6% (7940/8064)M >> > >> finish: 10:44:09 speed: 208 (204) K/sec (stalled) >> > >> >> > >> Is this a known bug and fixed in the further version? >> > >> _______________________________________________ >> > >> drbd-user mailing list >> > >> drbd-user at lists.linbit.com >> > >> http://lists.linbit.com/mailman/listinfo/drbd-user >> > > >> > > >> > > Maybe provide more details about the term "cluster" you are using. Do >> > > you >> > > have DRBD under control of crm like Pacemaker? If so are you running >> > > DRBD in >> > > dual primary mode maybe? And when does this state happen and under >> > > what >> > > conditions i.e restarted a node etc. >> >> What os is this on? Can you please paste the output of "crm status" (or >> pcs if you are on rhel7) and "crm_mon -Qrf1" >> >> >> _______________________________________________ >> drbd-user mailing list >> drbd-user at lists.linbit.com >> http://lists.linbit.com/mailman/listinfo/drbd-user >> > > Another thing I forgot .... I find it odd that the sync for only one of the > devices is stalled. Are they all using the same replication link? Any > networking issues or network card errors you can see?