[Drbd-dev] DRBD-8: BUG when disk write errors occur during heavyI/O

Graham, Simon Simon.Graham at stratus.com
Thu Jan 11 21:39:10 CET 2007


> 
> Simon, this is an excellent description of what is going on. I also
> have gone
> through it as well, and think that moving dec_local() is the correct
> solution.
> 
> Just have just committed it
> http://lists.linbit.com/pipermail/drbd-cvs/2007-January/001421.html
> 

Phil,

It turns out that this fix, whilst necessary I think, is not sufficient
-- specifically, it does not cover the case where the local request
fails and then later on the network request is ACK'd...

. When the local request fails, we run through
req_mod(write_completed_with_error) and at the end
  do the dec_local().
. If some other thread was attempting to set the local disk Diskless, it
will now see local_cnt==0
  and run, releasing the act_log and resync caches.
. Now the network request is acked and we run
req_mod(write_acked_by_peer) -- now that both local and
  remote are done, req_may_be_done does it's thing and ends up calling
drbd_al_complete_io which crashes
  because act_log is now NULL.

Now - one fix would be to check for act_log being NULL in
drbd_al_complete_io. However, I wonder if it might be more correct to
delay doing the dec_local() until we are definitely done with the
request?

This would mean moving it out of req_mod() completely and instead doing
it in req_may_be_done() when the request actually is complete on both
sides... (and if RQ_LOCAL_COMPLETED flag is set I think)

Simon


More information about the drbd-dev mailing list