[Drbd-dev] DRBD8: Sync hangs due to using freed ee

Fri Apr 20 00:03:25 CEST 2007

Hi all, 
This my second time actually reporting on this problem.  The first time
I did not
Have much info other then the logs but this time I think I understand a
bit more a bout the problem.  
So here it is.

Problem:
We are syncing and we are the sync target.
We submit a write io request and when it is done bio_endio calls
drbd_endio_write_sec()
When we access the sector field for the ee/bio in this routine it is
poisoned because we have
slab debugging turned on. The poison is 6d6d6d6d...This indicates that
we are freeing the ee
and then proceed to use it.
The symtom is that we get:
Apr 12 02:36:22  kernel: drbd1: drbd_rs_complete_io() called, but extent
not found
Apr 12 02:36:22  kernel: drbd1: al_complete_io() called on inactive
extent 1532713819

and we loop forever with:
Apr 12 02:43:19 ellwood kernel: drbd2: Retrying drbd_rs_del_all() later.
refcnt=1

Well a look at the code yields:
we add to the done_ee list in:
./src/drbd/drbd_receiver.c:    list_add_tail(&e->w.list,&mdev->done_ee);
./src/drbd/drbd_worker.c  :    list_add_tail(&e->w.list,&mdev->done_ee);

In ./src/drbd/drbd_receiver.c:receive_data() we add to the list and wake
up the asender and return.  We do
not attempt to use any ot the ee's after. So all maybe fine there.

However, in the ./src/drbd/drbd_worker.c:drbd_endio_write_sec() we add
to the list and then we call
drbd_rs_complete_io(mdev,e->sector)  and
drbd_al_complete_io(mdev,e->sector) Using the ee after it was placed on
the done list so it could
potentially be freed.  I suspect it is being freed in process_done_ee()
where drbd_free_ee is
called.

Instrumentation shows that the bio and ee are intact at the beginning of
drbd_endio_write_sec()
as shown in the log below:

Apr 19 17:28:12 godzilla kernel: drbd1: drbd_endio_write_sec: EM-- XX
BUG!! e->sector=7740398493674204011s cap_sector=10267584s
size=1802201963 bytes_done=32768  bio->bi_sector=774039849367420400

Apr 19 17:28:12 godzilla kernel:  [<c0105a67>] dump_stack+0x17/0x20
Apr 19 17:28:12 godzilla kernel:  [<ee2d0caf>]
drbd_endio_write_sec+0x16f/0x3a0 [drbd]
Apr 19 17:28:12 godzilla kernel:  [<c0177319>] bio_endio+0x59/0x90
Apr 19 17:28:12 godzilla kernel:  [<ee0ab2b9>] dec_pending+0x39/0x70
[dm_mod]
Apr 19 17:28:12 godzilla kernel:  [<ee0ab391>] clone_endio+0xa1/0xc0
[dm_mod]
Apr 19 17:28:12 godzilla kernel:  [<c0177319>] bio_endio+0x59/0x90
Apr 19 17:28:12 godzilla kernel:  [<c022638a>]
__end_that_request_first+0xba/0x330
Apr 19 17:28:12 godzilla kernel:  [<c0226618>]
end_that_request_chunk+0x8/0x10
Apr 19 17:28:12 godzilla kernel:  [<ee0209f5>]
scsi_end_request+0x25/0xe0 [scsi_mod]
Apr 19 17:28:12 godzilla kernel:  [<ee020c82>]
scsi_io_completion+0xd2/0x3c0 [scsi_mod]
Apr 19 17:28:12 godzilla kernel:  [<ee06a02a>] sd_rw_intr+0x14a/0x2c0
[sd_mod]
Notice the poisoned e->sector captured just before we call
drbd_rs_complete_io with it.
More importantly, notice the value of cap_sector which is what came out
of the kernel when
bio_endio called our callback.

What is the best way to fix this? Is it a matter of just moving
list_add_tail(&e->w.list,&mdev->done_ee);
to after we use e->sector?

Thanks,

EM--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.linbit.com/pipermail/drbd-dev/attachments/20070419/6bd0910c/attachment.htm