Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Tue, Feb 28, 2017 at 06:05:10PM +0530, Sunil Kumar wrote:
> Hi Lars,
>
> Thanks for the suggested changes, it solves the problem.
>
> >So the context that drbd_md_endio() runs in was not scheduled
> >for all the time it takes for drbd_adm_attach() to go all the way from
> >"drbd_md_read()" to ... "device->ldev = nbc;", several lock/unlock,
> >memory barriers and so on... and only then the drbd_md_endio() context
> >managed to evaluate the next line of code?
>
> I don't see an issue with drbd_md_endio() scheduling, its schedules
> properly. The only problem was 'drbd_md_put_buffer' which basically
> wakes up drbd_adm_attach process and makes ldev as valid pointer in
> case of drbd_md_read().
See
http://git.linbit.com/drbd-8.4.git/commitdiff/5f71d962446267f21c47cec59216fcadee2b4e8d
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
Race:
drbd_adm_attach() | async drbd_md_endio()
|
device->ldev is still NULL. |
|
drbd_md_read( |
.endio = drbd_md_endio; |
submit; |
.... |
wait for done == 1; | done = 1;
); | wake_up();
.. lot of other stuff, |
.. includeing taking and |
...giving up locks, |
.. doing further IO, |
.. stuff that takes "some time" |
| while in this context,
| this is the next statement.
0K . | which means this context was scheduled
.. only then, finally, | away for "some time".
device->ldev = nbc; |
| if (device->ldev)
| put_ldev()
Unlikely, but possible. I was able to provoke it "reliably"
by adding an mdelay(500); after the wake_up().
Fixed by moving the if (!NULL) put_ldev() before done = 1;
Impact of the bug was that the resulting refcount imbalance
could lead to premature destruction of the object, potentially
causing a NULL pointer dereference during a subsequent detach.
That's what I meant with "not scheduled for all the time ..."
Thanks,
Lars