Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I should give some additional information.
We build a two-node cluster with drbd. The following is the configuration.
resource drbd1 {
protocol A;
on host41 {
device /dev/drbd0 minor 0;
disk /dev/vgdrbd/oracle;
address ipv4 192.168.4.41:7780;
meta-disk internal;
}
on host42 {
device /dev/drbd0 minor 0;
disk /dev/vgdrbd/oracle;
address ipv4 192.168.4.42:7780;
meta-disk internal;
}
net {
ping-timeout 10;
}
disk {
on-io-error pass_on;
}
syncer {
rate 1000M;
csums-alg md5;
verify-alg crc32c;
}
}
/dev/vgdrbd/oracle is built on a external IBM DS3512 SAS storage,
connected to the host using RDAC multipath software.
We found oracle can not be started after doing the following operation:
1. Start oracle on host41. Oracle will be ok after started
2. confirm the state of drbd1 is Connected Primary/Secondary
Up2date/Up2date,
3. on host42, do
> drbdadm disconnect drbd1; drbdadm primary drbd1
>mount /dev/drbd /oradata
>start_oracle.sh
Then oracle can not be started on host42, and report the following error:
ORA-00600:
internal error code, arguments: [kcratr_nab_less_than_odr], [1],
[162], [678757], [683523], [], [], [], [], [], [], []
There is only ONE drbd device for oracle database file. And enable
oracle archivelog for oracle can not eliminate this error, but change
protocol to B does work
And then we add drbd_trace, we found all sector synced completely
before disconnect , here is the log
Jul 3 13:55:15 localhost kernel: block drbd1: drbd_main.c:3037:
drbd1_worker [13498] data >>> Data (sector 74104s, id
ffff88067728ce48, seq 19051, f 2)
Jul 3 13:55:15 localhost kernel: block drbd1: drbd_main.c:3037:
drbd1_worker [13498] data >>> Data (sector 285664s, id
ffff8805c3e75b38, seq 19052, f 2)
Jul 3 13:55:15 localhost kernel: block drbd1: drbd_main.c:2128:
drbd1_worker [13498] data >>> UnplugRemote (7)
Jul 3 13:55:15 localhost kernel: block drbd1: drbd_main.c:2128:
drbd1_worker [13498] data >>> Barrier (barrier 957522216)
Jul 3 13:55:16 localhost kernel: block drbd1: drbd_receiver.c:4973:
drbd1_asender [13528] meta <<< BarrierAck (barrier 957522214)
Jul 3 13:55:16 localhost kernel: block drbd1: drbd_receiver.c:4973:
drbd1_asender [13528] meta <<< BarrierAck (barrier 957522215)
Jul 3 13:55:16 localhost kernel: block drbd1: drbd_receiver.c:4973:
drbd1_asender [13528] meta <<< BarrierAck (barrier 957522216)
Jul 3 13:55:17 localhost kernel: block drbd1: peer( Secondary ->
Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Jul 3 13:55:17 localhost kernel: block drbd1: drbd_main.c:2128:
drbd1_receiver [13514] meta >>> StateChgReply (ret 1)
Jul 3 13:55:17 localhost kernel: block drbd1: new current UUID
E2418467BFAB2889:827E3E61AA89BBE1:071CA1C9E3FEC914:071BA1C9E3FEC914
Jul 3 13:55:17 localhost kernel: block drbd1: drbd_receiver.c:4031:
drbd1_receiver [13514] data <<< StateChgRequest (m 1f0 v 10 { conn(
Disconnecting )})
Jul 3 13:55:08 localhost kernel: block drbd1: drbd_receiver.c:4031:
drbd1_receiver [18869] data <<< Data (sector 74104s, id
ffff88067728ce48, seq 19051, f 2)
Jul 3 13:55:08 localhost kernel: block drbd1: drbd_receiver.c:4031:
drbd1_receiver [18869] data <<< Data (sector 285664s, id
ffff8805c3e75b38, seq 19052, f 2)
Jul 3 13:55:08 localhost kernel: block drbd1: drbd_receiver.c:4031:
drbd1_receiver [18869] data <<< UnplugRemote (7)
Jul 3 13:55:08 localhost kernel: block drbd1: drbd_receiver.c:4031:
drbd1_receiver [18869] data <<< Barrier (barrier 957522216)
Jul 3 13:55:08 localhost kernel: block drbd1: drbd_main.c:2128:
drbd1_worker [3248] meta >>> BarrierAck (barrier 957522216)
Jul 3 13:55:08 localhost kernel: block drbd1: drbd_main.c:2128:
cqueue [3216] data >>> StateChgRequest (m 1f0 v 10 { conn(
Disconnecting )})
Jul 3 13:55:09 localhost kernel: block drbd1: drbd_receiver.c:4973:
drbd1_asender [19247] meta <<< StateChgReply (ret 1)
Jul 3 13:55:09 localhost kernel: block drbd1: meta connection shut
down by peer.
Jul 3 13:55:09 localhost kernel: block drbd1: peer( Primary ->
Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )
addtional oracle alert.log
Dump file /oracle/app/oracle/diag/rdbms/ems/ems/incident/incdir_136177/ems_ora_23990_i136177.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/app/oracle/product/11.2.0/dbhome_1
System name: Linux
Node name: host41
Release: 2.6.32-220.el6.x86_64
Version: #1 SMP Thu May 31 22:47:14 EDT 2012
Machine: x86_64
Instance name: ems
Redo thread mounted by this instance: 1
Oracle process number: 22
Unix process pid: 23990, image: oracle at host41 (TNS V1-V3)
*** 2012-06-29 09:50:12.161
*** SESSION ID:(727.3) 2012-06-29 09:50:12.161
*** CLIENT ID:() 2012-06-29 09:50:12.161
*** SERVICE NAME:() 2012-06-29 09:50:12.161
*** MODULE NAME:(sqlplus at cgsl-160 (TNS V1-V3)) 2012-06-29 09:50:12.161
*** ACTION NAME:() 2012-06-29 09:50:12.161
Dump continued from file:
/oracle/app/oracle/diag/rdbms/ems/ems/trace/ems_ora_23990.trc
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr],
[1], [162], [678757], [683523], [], [], [], [], [], [], []
========= Dump for incident 136177 (ORA 600 [kcratr_nab_less_than_odr]) ========
2012/7/2 Lars Ellenberg <lars.ellenberg at linbit.com>:
> On Mon, Jul 02, 2012 at 10:10:39PM +0800, Lyre wrote:
>> We are still looking into it. AFAIK, if we change protocol from A to B, or
>> if we enable archivelog, oracle on secondary is able to start after paimary
>> reboot.
>
> So you rotate/delete your online redo logs too early,
> even though they may still be needed for an instance recovery.
> Or you "forgot" to replicate those redo logs.
>
> Or you don't replicate redo logs and database in the same stream,
> while you should.
>
>> 在 2012-7-2 下午10:03,"Radu Radutiu" <rradutiu at gmail.com>写道:
>>
>> > Are all your database files on the same filesystem or at least on the same
>> > DRBD resource? I think that this type of error shows that your control file
>> > is not in sync with the other db files. Performing PITR will allow you to
>> > bring the db online with no or very little data loss. Also see
>> > https://forums.oracle.com/forums/thread.jspa?threadID=1088888
>> >
>> > Radu
>> >
>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria.
> __
> please don't Cc me, but send to list -- I'm subscribed
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user