[Drbd-dev] Crash in lru_cache.c

Sat Jan 12 16:23:58 CET 2008

> > Because there WAS a disk when the request was issued - in fact, the
> > local write to disk completed successfully, but the request is still
> > sitting in the TL waiting for the next barrier to complete.
> Subsequent
> > to that but while the request is still in the TL, the local disk is
> > detached.
> 
> AND it is re-attached so fast,
> that we have a new (uhm; well, probably the same?) disk again,
> while still the very same request is sitting there
> waiting for that very barrier ack?
> 

You got it!

> now, how unlikely is THAT to happen in real life.
> 

Fairly rare I agree although someone could do a 'drbdadm detach' and
then 'drbdadm attach' -- that's how we hit this situation (and the
reason for THAT is as a way to test errors on meta-data reads)

Given that there is no real boundary on the lifetime of a request in the
TL, it's also feasible (although unlikely I agree) that a disk could
fail and be replaced and reattached whilst an old request is still in
the TL...

> > I think this might work but only as a side effect -- if you look
back
> to
> > the sequence I documented, you will see that there has to be a write
> > request to the same AL area after the disk is reattached - this is
> > because drbd_al_complete_io quietly ignores the case where no active
> AL
> > extent is found for the request being completed.
> 
> huh?
> I simply disallow re-attaching while there are still requests pending
> from before the detach.
> no more (s & RQ_LOCAL_MASK), no more un-accounted for references.
> 

Yes but those requests that have unaccounted references from before the
detach are still in the TL -- it so happens that the code does not crash
in this case (completing a request in the TL when there is no matching
AL cache entry) but that's not very safe I think.

You also have to trigger a barrier as part of this -- not only block new
requests during attach until the TL is empty but also trigger a barrier
so that the TL will be emptied...

Both of these are why I like the idea of "reconnecting" the requests in
the TL to the AL cache when doing an attach...

> if I understand correctly,
> you can reproduce this easily.
> to underline my point,
> does it still trigger when you do
>  "dd if=/dev/drbdX of=/dev/null bs=1b count=1 iflag=direct ; sleep 5"
> before the re-attach?

So, the real test is to do this _before_ the DETACH, then see what
happens when the requests are removed from the TL.

> for other reasons, I think we need to rewrite the barrier code anyways
> to send out the barrier as soon as possible, and not wait until the
> next
> io request comes in.

That's an interesting idea -- it would also allow you to use the Linux
barrier mechanism to implement. Still wouldn't handle this case I think
though -- you can have requests in the TL that do not yet require a
barrier when you lose the local disk...

Simon