[DRBD-user] NULL deref at drbd_submit_peer_request

Lars Ellenberg lars.ellenberg at linbit.com
Fri Feb 24 13:42:15 CET 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Thu, Feb 23, 2017 at 12:31:21PM +0530, Sunil Kumar wrote:
> Hi Everyone,
> 
> Thanks for the reply Robert.


> [ 8835.609449]  [<ffffffffa01e4398>] __put_ldev+0x198/0x1c0 [drbd]
> [ 8835.609458]  [<ffffffffa01f1d4d>] drbd_adm_attach+0xf1d/0x10f0 [drbd]

> The drbd_md_endio() routine release the ldev reference even called from
> drbd_md_read(). Which should not happen as ldev reference is not taken in
> drbd_md_read() routine. The drbd_md_endio() calls drbd_md_put_buffer() which
> release the mdio usage count (i.e. mdio.in_use) and enables the
> drbd_adm_attach() to proceed further.
> --------------------------------------------------------------------
> BIO_ENDIO_TYPE drbd_md_endio BIO_ENDIO_ARGS(struct bio *bio, int error)
> {
>         ...
>         drbd_md_put_buffer(device);
>         device->md_io.done = 1;
>         wake_up(&device->misc_wait);
>         bio_put(bio);
> 
>         if (device->ldev) /* special case: drbd_md_read() during
> drbd_adm_attach() */
>                 put_ldev(device);
> 
>         BIO_ENDIO_FN_RETURN;
> }
> --------------------------------------------------------------------

Interesting.

So the context that drbd_md_endio() runs in was not scheduled 
for all the time it takes for drbd_adm_attach() to go all the way from
"drbd_md_read()" to ... "device->ldev = nbc;", several lock/unlock,
memory barriers and so on... and only then the drbd_md_endio() context
managed to evaluate the next line of code?

Where endio functions must not block, because they may be run in
any context, including IRQ context.

I'm not sure I like that working theory.

But if you have reason to suspect this as root cause,
I'd rather fix it like this:

diff --git a/drbd/drbd_worker.c b/drbd/drbd_worker.c
index bc60c53..033a2bd 100644
--- a/drbd/drbd_worker.c
+++ b/drbd/drbd_worker.c
@@ -61,12 +61,16 @@ static int make_resync_request(struct drbd_device *, int);
 BIO_ENDIO_TYPE drbd_md_endio BIO_ENDIO_ARGS(struct bio *bio, int error)
 {
 	struct drbd_device *device;
+	bool do_put_ldev;
 
 	BIO_ENDIO_FN_START;
 
 	device = bio->bi_private;
 	device->md_io.error = error;
 
+	/* special case: drbd_md_read() during drbd_adm_attach() */
+	do_put_ldev = device->ldev != NULL;
+
 	/* We grabbed an extra reference in _drbd_md_sync_page_io() to be able
 	 * to timeout on the lower level device, and eventually detach from it.
 	 * If this io completion runs after that timeout expired, this
@@ -82,7 +86,7 @@ BIO_ENDIO_TYPE drbd_md_endio BIO_ENDIO_ARGS(struct bio *bio, int error)
 	device->md_io.done = 1;
 	wake_up(&device->misc_wait);
 	bio_put(bio);
-	if (device->ldev) /* special case: drbd_md_read() during drbd_adm_attach() */
+	if (do_put_ldev)
 		put_ldev(device);
 
 	BIO_ENDIO_FN_RETURN;


> After the above changes the ASSERT is not getting triggered.
> Need you help in figure out the possible regression due to above changes.
> Or is there any other way to fix this issue ?

I find it hard to believe that this would actually occur,
but if you can reproduce without above patch,
and no longer reproduce with above patch,
I'd consider it empirically proven.

Thanks,


-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed



More information about the drbd-user mailing list