[DRBD-user] A problem about oracle on drbd

Wed Jul 4 03:20:23 CEST 2012

I should give some additional information.

We build a two-node cluster with drbd.   The following is the configuration.

resource drbd1 {
    protocol               A;
    on host41 {
        device           /dev/drbd0 minor 0;
        disk             /dev/vgdrbd/oracle;
        address          ipv4 192.168.4.41:7780;
        meta-disk        internal;
    }
    on host42 {
        device           /dev/drbd0 minor 0;
        disk             /dev/vgdrbd/oracle;
        address          ipv4 192.168.4.42:7780;
        meta-disk        internal;
    }
    net {
        ping-timeout      10;
    }
    disk {
        on-io-error      pass_on;
    }
    syncer {
        rate             1000M;
        csums-alg        md5;
        verify-alg       crc32c;
    }
}

/dev/vgdrbd/oracle  is built on a external IBM DS3512 SAS storage,
connected to the host using RDAC multipath software.

We found oracle can not be started after doing the following operation:

1. Start oracle on host41.  Oracle will be ok after started
2. confirm the state of drbd1 is Connected   Primary/Secondary
Up2date/Up2date,
3. on host42,   do
> drbdadm disconnect drbd1; drbdadm primary drbd1
>mount /dev/drbd /oradata
>start_oracle.sh

Then oracle can not be started on host42, and report the following error:

ORA-00600:
internal error code, arguments: [kcratr_nab_less_than_odr], [1],
[162], [678757], [683523], [], [], [], [], [], [], []

There is only ONE drbd device for oracle database file.  And enable
oracle archivelog for oracle can not eliminate this error, but  change
protocol to B does work

And then we add drbd_trace, we found all sector  synced completely
before disconnect , here is the log

Jul  3 13:55:15 localhost kernel: block drbd1: drbd_main.c:3037:
drbd1_worker [13498] data >>> Data (sector 74104s, id
ffff88067728ce48, seq 19051, f 2)
Jul  3 13:55:15 localhost kernel: block drbd1: drbd_main.c:3037:
drbd1_worker [13498] data >>> Data (sector 285664s, id
ffff8805c3e75b38, seq 19052, f 2)
Jul  3 13:55:15 localhost kernel: block drbd1: drbd_main.c:2128:
drbd1_worker [13498] data >>> UnplugRemote (7)
Jul  3 13:55:15 localhost kernel: block drbd1: drbd_main.c:2128:
drbd1_worker [13498] data >>> Barrier (barrier 957522216)
Jul  3 13:55:16 localhost kernel: block drbd1: drbd_receiver.c:4973:
drbd1_asender [13528] meta <<< BarrierAck (barrier 957522214)
Jul  3 13:55:16 localhost kernel: block drbd1: drbd_receiver.c:4973:
drbd1_asender [13528] meta <<< BarrierAck (barrier 957522215)
Jul  3 13:55:16 localhost kernel: block drbd1: drbd_receiver.c:4973:
drbd1_asender [13528] meta <<< BarrierAck (barrier 957522216)
Jul  3 13:55:17 localhost kernel: block drbd1: peer( Secondary ->
Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Jul  3 13:55:17 localhost kernel: block drbd1: drbd_main.c:2128:
drbd1_receiver [13514] meta >>> StateChgReply (ret 1)
Jul  3 13:55:17 localhost kernel: block drbd1: new current UUID
E2418467BFAB2889:827E3E61AA89BBE1:071CA1C9E3FEC914:071BA1C9E3FEC914
Jul  3 13:55:17 localhost kernel: block drbd1: drbd_receiver.c:4031:
drbd1_receiver [13514] data <<< StateChgRequest (m 1f0 v 10 { conn(
Disconnecting )})

Jul  3 13:55:08 localhost kernel: block drbd1: drbd_receiver.c:4031:
drbd1_receiver [18869] data <<< Data (sector 74104s, id
ffff88067728ce48, seq 19051, f 2)
Jul  3 13:55:08 localhost kernel: block drbd1: drbd_receiver.c:4031:
drbd1_receiver [18869] data <<< Data (sector 285664s, id
ffff8805c3e75b38, seq 19052, f 2)
Jul  3 13:55:08 localhost kernel: block drbd1: drbd_receiver.c:4031:
drbd1_receiver [18869] data <<< UnplugRemote (7)
Jul  3 13:55:08 localhost kernel: block drbd1: drbd_receiver.c:4031:
drbd1_receiver [18869] data <<< Barrier (barrier 957522216)
Jul  3 13:55:08 localhost kernel: block drbd1: drbd_main.c:2128:
drbd1_worker [3248] meta >>> BarrierAck (barrier 957522216)
Jul  3 13:55:08 localhost kernel: block drbd1: drbd_main.c:2128:
cqueue [3216] data >>> StateChgRequest (m 1f0 v 10 { conn(
Disconnecting )})
Jul  3 13:55:09 localhost kernel: block drbd1: drbd_receiver.c:4973:
drbd1_asender [19247] meta <<< StateChgReply (ret 1)
Jul  3 13:55:09 localhost kernel: block drbd1: meta connection shut
down by peer.
Jul  3 13:55:09 localhost kernel: block drbd1: peer( Primary ->
Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )

addtional oracle alert.log

Dump file /oracle/app/oracle/diag/rdbms/ems/ems/incident/incdir_136177/ems_ora_23990_i136177.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/app/oracle/product/11.2.0/dbhome_1
System name:	Linux
Node name:	host41
Release:	2.6.32-220.el6.x86_64
Version:	#1 SMP Thu May 31 22:47:14 EDT 2012
Machine:	x86_64
Instance name: ems
Redo thread mounted by this instance: 1
Oracle process number: 22
Unix process pid: 23990, image: oracle at host41 (TNS V1-V3)

*** 2012-06-29 09:50:12.161
*** SESSION ID:(727.3) 2012-06-29 09:50:12.161
*** CLIENT ID:() 2012-06-29 09:50:12.161
*** SERVICE NAME:() 2012-06-29 09:50:12.161
*** MODULE NAME:(sqlplus at cgsl-160 (TNS V1-V3)) 2012-06-29 09:50:12.161
*** ACTION NAME:() 2012-06-29 09:50:12.161

Dump continued from file:
/oracle/app/oracle/diag/rdbms/ems/ems/trace/ems_ora_23990.trc
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr],
[1], [162], [678757], [683523], [], [], [], [], [], [], []

========= Dump for incident 136177 (ORA 600 [kcratr_nab_less_than_odr]) ========

2012/7/2 Lars Ellenberg <lars.ellenberg at linbit.com>:
> On Mon, Jul 02, 2012 at 10:10:39PM +0800, Lyre wrote:
>> We are still looking into it. AFAIK, if we change protocol from A to B, or
>> if we enable archivelog, oracle on secondary is able to start after paimary
>> reboot.
>
> So you rotate/delete your online redo logs too early,
> even though they may still be needed for an instance recovery.
> Or you "forgot" to replicate those redo logs.
>
> Or you don't replicate redo logs and database in the same stream,
> while you should.
>
>> 在 2012-7-2 下午10:03，"Radu Radutiu" <rradutiu at gmail.com>写道：
>>
>> > Are all your database files on the same filesystem or at least on the same
>> > DRBD resource? I think that this type of error shows that your control file
>> > is not in sync with the other db files. Performing PITR will allow you to
>> > bring the db online with no or very little data loss. Also see
>> > https://forums.oracle.com/forums/thread.jspa?threadID=1088888
>> >
>> > Radu
>> >
>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list   --   I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user