[DRBD-user] oracle on drbd failed

Sun Sep 2 22:45:32 CEST 2012

I use drbd_trace to trace drbd write operation  when running oracle,
it show info like this;

block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:2152:
drbd0_worker [5323] data >>> Barrier (barrier 435610040)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_receiver.c:5005:
drbd0_asender [11122] meta <<< BarrierAck (barrier 435610037)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_receiver.c:5005:
drbd0_asender [11122] meta <<< BarrierAck (barrier 435610038)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_receiver.c:5005:
drbd0_asender [11122] meta <<< BarrierAck (barrier 435610039)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_receiver.c:5005:
drbd0_asender [11122] meta <<< BarrierAck (barrier 435610040)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:3062:
drbd0_worker [5323] data >>> Data (sector 1b5000s, offset=36a00000, id
ffff880080cadd68, seq 30957, size=1000, f 2)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:3062:
drbd0_worker [5323] data >>> Data (sector 1b5008s, offset=36a01000, id
ffff880080cad438, seq 30958, size=1000, f 2)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:3062:
drbd0_worker [5323] data >>> Data (sector 1b5010s, offset=36a02000, id
ffff880080cad4a8, seq 30959, size=1000, f 2)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:3062:
drbd0_worker [5323] data >>> Data (sector 1b5018s, offset=36a03000, id
ffff880080cadcf8, seq 30960, size=1000, f 2)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:2152:
drbd0_worker [5323] data >>> UnplugRemote (7)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:2152:
drbd0_worker [5323] data >>> Barrier (barrier 435610041)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:3062:
drbd0_worker [5323] data >>> Data (sector 0s, offset=0, id
ffff880080cad0b8, seq 30961, size=0, f 2a)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:3062:
drbd0_worker [5323] data >>> Data (sector 0s, offset=0, id
ffff880080cad358, seq 30962, size=0, f 2a)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:3062:
drbd0_worker [5323] data >>> Data (sector 0s, offset=0, id
ffff880080cad128, seq 30963, size=0, f 2a)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:2152:
drbd0_worker [5323] data >>> Barrier (barrier 435610042)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:3062:
drbd0_worker [5323] data >>> Data (sector 46f80s, offset=8df0000, id
ffff880080cad3c8, seq 30964, size=1000, f 2)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:3062:
drbd0_worker [5323] data >>> Data (sector 46f88s, offset=8df1000, id
ffff880080cade48, seq 30965, size=1000, f 2)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:3062:
drbd0_worker [5323] data >>> Data (sector 46f90s, offset=8df2000, id
ffff880080cad908, seq 30966, size=1000, f 2)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:3062:
drbd0_worker [5323] data >>> Data (sector 46f98s, offset=8df3000, id
ffff880080cad898, seq 30967, size=1000, f 2)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:2152:
drbd0_worker [5323] data >>> UnplugRemote (7)
block drbd0: /root/rpmbuild/BUILD/drbd-8.3.13/drbd/drbd_main.c:2152:
drbd0_worker [5323] data >>> Barrier (barrier 435610043)

It's obvious  that oracle instance write block at size
4*0x01000=4*4096. Is it possible the fail occured when the secondary
node can not recv and write the full 4*4096 block at network failure?

If it's true,  how to handle this situation?

2012/8/30 Felix Frank <ff at mpexnet.de>:
> On 08/30/2012 09:38 AM, Felix Frank wrote:
>>> I think you just misunderstood me.     The key action for this test is
>>> >
>>> > drbdadm disconnect
>>> > drbdadm primary
>>> >
>>> > which simulate the situation that the primary is crashed to test  if
>>> > the oracle can be fail over on secondary node
>>> >
>>> > drbdadm --discard-my-data connect drbd0
>>> >
>>> > the action just keep the secondary's data sync with the primary data
>>> > for the next test.
>> ...assuming the primary had not accumulated some minor corruptions
>> during an earlier loop iteration.
>
> Which reminds me: After failing a protocol A resource, it's important to
> perform a verify.
>
> Oracle *will* clean up any mess on the new primary, but without a full
> sync back, you cannot be entirely sure that the old primary does not
> retain any old writes that hadn't made it to the new primary. The
> activity log is supposed to protect you from this, but I disbelieve it
> can keep you 100% safe.
>
> Cheers,
> Felix