Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I should give some additional information. We build a two-node cluster with drbd. The following is the configuration. resource drbd1 { protocol A; on host41 { device /dev/drbd0 minor 0; disk /dev/vgdrbd/oracle; address ipv4 192.168.4.41:7780; meta-disk internal; } on host42 { device /dev/drbd0 minor 0; disk /dev/vgdrbd/oracle; address ipv4 192.168.4.42:7780; meta-disk internal; } net { ping-timeout 10; } disk { on-io-error pass_on; } syncer { rate 1000M; csums-alg md5; verify-alg crc32c; } } /dev/vgdrbd/oracle is built on a external IBM DS3512 SAS storage, connected to the host using RDAC multipath software. We found oracle can not be started after doing the following operation: 1. Start oracle on host41. Oracle will be ok after started 2. confirm the state of drbd1 is Connected Primary/Secondary Up2date/Up2date, 3. on host42, do > drbdadm disconnect drbd1; drbdadm primary drbd1 >mount /dev/drbd /oradata >start_oracle.sh Then oracle can not be started on host42, and report the following error: ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [162], [678757], [683523], [], [], [], [], [], [], [] There is only ONE drbd device for oracle database file. And enable oracle archivelog for oracle can not eliminate this error, but change protocol to B does work And then we add drbd_trace, we found all sector synced completely before disconnect , here is the log Jul 3 13:55:15 localhost kernel: block drbd1: drbd_main.c:3037: drbd1_worker [13498] data >>> Data (sector 74104s, id ffff88067728ce48, seq 19051, f 2) Jul 3 13:55:15 localhost kernel: block drbd1: drbd_main.c:3037: drbd1_worker [13498] data >>> Data (sector 285664s, id ffff8805c3e75b38, seq 19052, f 2) Jul 3 13:55:15 localhost kernel: block drbd1: drbd_main.c:2128: drbd1_worker [13498] data >>> UnplugRemote (7) Jul 3 13:55:15 localhost kernel: block drbd1: drbd_main.c:2128: drbd1_worker [13498] data >>> Barrier (barrier 957522216) Jul 3 13:55:16 localhost kernel: block drbd1: drbd_receiver.c:4973: drbd1_asender [13528] meta <<< BarrierAck (barrier 957522214) Jul 3 13:55:16 localhost kernel: block drbd1: drbd_receiver.c:4973: drbd1_asender [13528] meta <<< BarrierAck (barrier 957522215) Jul 3 13:55:16 localhost kernel: block drbd1: drbd_receiver.c:4973: drbd1_asender [13528] meta <<< BarrierAck (barrier 957522216) Jul 3 13:55:17 localhost kernel: block drbd1: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown ) Jul 3 13:55:17 localhost kernel: block drbd1: drbd_main.c:2128: drbd1_receiver [13514] meta >>> StateChgReply (ret 1) Jul 3 13:55:17 localhost kernel: block drbd1: new current UUID E2418467BFAB2889:827E3E61AA89BBE1:071CA1C9E3FEC914:071BA1C9E3FEC914 Jul 3 13:55:17 localhost kernel: block drbd1: drbd_receiver.c:4031: drbd1_receiver [13514] data <<< StateChgRequest (m 1f0 v 10 { conn( Disconnecting )}) Jul 3 13:55:08 localhost kernel: block drbd1: drbd_receiver.c:4031: drbd1_receiver [18869] data <<< Data (sector 74104s, id ffff88067728ce48, seq 19051, f 2) Jul 3 13:55:08 localhost kernel: block drbd1: drbd_receiver.c:4031: drbd1_receiver [18869] data <<< Data (sector 285664s, id ffff8805c3e75b38, seq 19052, f 2) Jul 3 13:55:08 localhost kernel: block drbd1: drbd_receiver.c:4031: drbd1_receiver [18869] data <<< UnplugRemote (7) Jul 3 13:55:08 localhost kernel: block drbd1: drbd_receiver.c:4031: drbd1_receiver [18869] data <<< Barrier (barrier 957522216) Jul 3 13:55:08 localhost kernel: block drbd1: drbd_main.c:2128: drbd1_worker [3248] meta >>> BarrierAck (barrier 957522216) Jul 3 13:55:08 localhost kernel: block drbd1: drbd_main.c:2128: cqueue [3216] data >>> StateChgRequest (m 1f0 v 10 { conn( Disconnecting )}) Jul 3 13:55:09 localhost kernel: block drbd1: drbd_receiver.c:4973: drbd1_asender [19247] meta <<< StateChgReply (ret 1) Jul 3 13:55:09 localhost kernel: block drbd1: meta connection shut down by peer. Jul 3 13:55:09 localhost kernel: block drbd1: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) addtional oracle alert.log Dump file /oracle/app/oracle/diag/rdbms/ems/ems/incident/incdir_136177/ems_ora_23990_i136177.trc Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options ORACLE_HOME = /oracle/app/oracle/product/11.2.0/dbhome_1 System name: Linux Node name: host41 Release: 2.6.32-220.el6.x86_64 Version: #1 SMP Thu May 31 22:47:14 EDT 2012 Machine: x86_64 Instance name: ems Redo thread mounted by this instance: 1 Oracle process number: 22 Unix process pid: 23990, image: oracle at host41 (TNS V1-V3) *** 2012-06-29 09:50:12.161 *** SESSION ID:(727.3) 2012-06-29 09:50:12.161 *** CLIENT ID:() 2012-06-29 09:50:12.161 *** SERVICE NAME:() 2012-06-29 09:50:12.161 *** MODULE NAME:(sqlplus at cgsl-160 (TNS V1-V3)) 2012-06-29 09:50:12.161 *** ACTION NAME:() 2012-06-29 09:50:12.161 Dump continued from file: /oracle/app/oracle/diag/rdbms/ems/ems/trace/ems_ora_23990.trc ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [162], [678757], [683523], [], [], [], [], [], [], [] ========= Dump for incident 136177 (ORA 600 [kcratr_nab_less_than_odr]) ======== 2012/7/2 Lars Ellenberg <lars.ellenberg at linbit.com>: > On Mon, Jul 02, 2012 at 10:10:39PM +0800, Lyre wrote: >> We are still looking into it. AFAIK, if we change protocol from A to B, or >> if we enable archivelog, oracle on secondary is able to start after paimary >> reboot. > > So you rotate/delete your online redo logs too early, > even though they may still be needed for an instance recovery. > Or you "forgot" to replicate those redo logs. > > Or you don't replicate redo logs and database in the same stream, > while you should. > >> 在 2012-7-2 下午10:03,"Radu Radutiu" <rradutiu at gmail.com>写道: >> >> > Are all your database files on the same filesystem or at least on the same >> > DRBD resource? I think that this type of error shows that your control file >> > is not in sync with the other db files. Performing PITR will allow you to >> > bring the db online with no or very little data loss. Also see >> > https://forums.oracle.com/forums/thread.jspa?threadID=1088888 >> > >> > Radu >> > > >> _______________________________________________ >> drbd-user mailing list >> drbd-user at lists.linbit.com >> http://lists.linbit.com/mailman/listinfo/drbd-user > > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria. > __ > please don't Cc me, but send to list -- I'm subscribed > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user