Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Everyone, Thanks for the reply Robert. We are giving the ample amount of time to complete the previous detach operation, approx 10-40 seconds detach followed by the attach operations. It takes approx 3-5 hours for ASSERT (in put_ldev) got triggered with the attached test script. I have added debug prints in get_ldev & put_ldev (while exiting) to keep track of the 'local_cnt'and added a kernel panic when the 'local_cnt' reference goes negative, the issue is reproducing during attach operation. Observed that the extra reference of the 'local_ldev' is decremented by the drbd_md_endio() routine. -------------------------------------------------------------------- [ 8819.467228] block drbd6: disk( Failed -> Diskless ) [ 8819.467738] block drbd6: EXIT: put_ldev is called from after_state_ch, count=0 ds=0 thread=drbd_w_r4[2336] [ 8825.141885] block drbd6: role( Secondary -> Primary ) [ 8826.838617] EXT3 FS on drbd6, internal journal [ 8828.132369] block drbd6: role( Primary -> Secondary ) /* drbd_md_endio entry */ [ 8835.585126] block drbd6: drbd_md_endio: meta-data in used by drbd_md_read [ 8835.587660] block drbd6: disk( Diskless -> Attaching ) [ 8835.587675] block drbd6: EXIT: put_ldev is called from __get_ldev_if_state, count=0 ds=1 thread=drbdsetup-84[11848] [ 8835.587701] block drbd6: EXIT: get_ldev is called from drbd_adm_attach, count=1 ds=1 wds=1 thread=drbdsetup-84[11848] [ 8835.587880] block drbd6: EXIT: put_ldev is called from __get_ldev_if_state, count=1 ds=1 thread=drbdsetup-84[11848] [ 8835.587884] block drbd6: This kernel is too old, no WRITE_SAME support. [ 8835.587890] block drbd6: EXIT: get_ldev is called from bm_rw, count=2 ds=1 wds=1 thread=drbdsetup-84[11848] /* drbd_md_endio exit, problem here */ [ 8835.587969] block drbd6: EXIT: put_ldev is called from drbd_md_endio, count=1 ds=1 thread=swapper[0] [ 8835.590273] block drbd6: recounting of set bits took additional 0 jiffies [ 8835.590276] block drbd6: 47 MB (12070 bits) marked out-of-sync by on disk bit-map. [ 8835.590279] block drbd6: EXIT: put_ldev is called from drbd_bm_aio_ctx_destroy, count=0 ds=1 thread=drbdsetup-84[11848] [ 8835.590284] block drbd6: EXIT: put_ldev is called from __get_ldev_if_state, count=0 ds=1 thread=drbdsetup-84[11848] [ 8835.590286] block drbd6: EXIT: put_ldev is called from __get_ldev_if_state, count=0 ds=1 thread=drbdsetup-84[11848] [ 8835.590289] block drbd6: disk( Attaching -> Negotiating ) [ 8835.590296] block drbd6: EXIT: get_ldev is called from drbd_print_uuids, count=1 ds=3 wds=3 thread=drbdsetup-84[11848] [ 8835.590298] block drbd6: attached to UUIDs 0000000000000004:0000000000000000:0000000000000000:0000000000000000 [ 8835.590300] block drbd6: EXIT: put_ldev is called from drbd_print_uuids, count=0 ds=3 thread=drbdsetup-84[11848] [ 8835.590302] block drbd6: EXIT: put_ldev is called from __get_ldev_if_state, count=0 ds=3 thread=drbdsetup-84[11848] [ 8835.590306] block drbd6: EXIT: get_ldev is called from drbd_md_sync, count=1 ds=3 wds=2 thread=drbdsetup-84[11848] [ 8835.590310] block drbd6: EXIT: get_ldev is called from _drbd_md_sync_page_io, count=2 ds=3 wds=1 thread=drbdsetup-84[11848] [ 8835.607888] block drbd6: drbd_md_endio: meta-data in used by drbd_md_sync [ 8835.607912] block drbd6: EXIT: put_ldev is called from drbd_md_endio, count=1 ds=3 thread=swapper[0] [ 8835.607927] block drbd6: EXIT: put_ldev is called from drbd_md_sync, count=0 ds=3 thread=drbdsetup-84[11848] [ 8835.608016] block drbd6: ASSERT( i >= 0 ) in drivers/block/drbd/drbd_int.h:2287 [ 8835.608030] Kernel panic - not syncing: drbd_assert_breakpoint [ 8835.608036] Pid: 11848, comm: drbdsetup-84 Not tainted 2.6.32.59+ #62 [ 8835.608038] Call Trace: [ 8835.608127] [<ffffffff8160bc03>] panic+0xfc/0x1c2 [ 8835.609436] [<ffffffffa01baf04>] drbd_assert_breakpoint+0xa4/0xb0 [drbd] [ 8835.609449] [<ffffffffa01e4398>] __put_ldev+0x198/0x1c0 [drbd] [ 8835.609458] [<ffffffffa01f1d4d>] drbd_adm_attach+0xf1d/0x10f0 [drbd] [ 8835.609476] [<ffffffff812f451f>] ? nla_parse+0xef/0x110 [ 8835.609489] [<ffffffff8151a186>] genl_rcv_msg+0x1e6/0x220 [ 8835.609495] [<ffffffff81519fa0>] ? genl_rcv_msg+0x0/0x220 [ 8835.609498] [<ffffffff81516fd9>] netlink_rcv_skb+0xa9/0xd0 [ 8835.609502] [<ffffffff815189bc>] genl_rcv+0x2c/0x40 [ 8835.609506] [<ffffffff81516d23>] netlink_unicast+0x2b3/0x2c0 [ 8835.609510] [<ffffffff81517e09>] netlink_sendmsg+0x269/0x380 [ 8835.609520] [<ffffffff81196b5a>] ? __blkdev_get+0x1da/0x410 [ 8835.609529] [<ffffffff814e0644>] sock_aio_write+0x124/0x1a0 [ 8835.609536] [<ffffffff812db6e7>] ? kobject_put+0x27/0x60 [ 8835.609549] [<ffffffff81165d7a>] do_sync_write+0xfa/0x140 [ 8835.609563] [<ffffffff810b5ce0>] ? autoremove_wake_function+0x0/0x40 [ 8835.609567] [<ffffffff81167caf>] ? __fput+0x19f/0x210 [ 8835.609572] [<ffffffff81166124>] vfs_write+0x184/0x1a0 [ 8835.609575] [<ffffffff81167525>] ? fget_light+0x15/0xc0 [ 8835.609578] [<ffffffff811669fc>] sys_write+0x5c/0xf0 [ 8835.609592] [<ffffffff81073a70>] sysenter_dispatch+0x7/0x2e -------------------------------------------------------------------- The drbd_md_endio() routine release the ldev reference even called from drbd_md_read(). Which should not happen as ldev reference is not taken in drbd_md_read() routine. The drbd_md_endio() calls drbd_md_put_buffer() which release the mdio usage count (i.e. mdio.in_use) and enables the drbd_adm_attach() to proceed further. -------------------------------------------------------------------- BIO_ENDIO_TYPE drbd_md_endio BIO_ENDIO_ARGS(struct bio *bio, int error) { ... drbd_md_put_buffer(device); device->md_io.done = 1; wake_up(&device->misc_wait); bio_put(bio); if (device->ldev) /* special case: drbd_md_read() during drbd_adm_attach() */ put_ldev(device); BIO_ENDIO_FN_RETURN; } -------------------------------------------------------------------- So when drbd_md_endio() calls the put_ldev(), it has valid 'device->ldev' reference and it decrement the 'local_cnt' by calling the put_ldev(). Which brings the inconsistancy in the 'local_cnt'. And latter when put_ldev() called from drbd_adm_attach() triggers the ASSERT. To make sure the put_ldev() is not getting called when called from the drbd_md_read(), I have added the below check -------------------------------------------------------------------- diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c index fc9d266e9b40..0292bf16f4df 100644 --- a/drivers/block/drbd/drbd_worker.c +++ b/drivers/block/drbd/drbd_worker.c @@ -82,8 +82,12 @@ BIO_ENDIO_TYPE drbd_md_endio BIO_ENDIO_ARGS(struct bio *bio, int error) device->md_io.done = 1; wake_up(&device->misc_wait); bio_put(bio); - if (device->ldev) /* special case: drbd_md_read() during drbd_adm_attach() */ + /* special case: don't call put_ldev when endio initiated + from drbd_md_read() during drbd_adm_attach() */ + if (device->ldev && strcmp(device->md_io.current_use,"drbd_md_read")) { + drbd_info(device, "%s: meta-data is used by %s\n",__func__,device->md_io.current_use); put_ldev(device); + } BIO_ENDIO_FN_RETURN; } -------------------------------------------------------------------- After the above changes the ASSERT is not getting triggered. Need you help in figure out the possible regression due to above changes. Or is there any other way to fix this issue ? Regards Sunil Kumar -------------- next part -------------- A non-text attachment was scrubbed... Name: test_drbd.sh Type: application/x-shellscript Size: 1117 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170223/fd3bea57/attachment.bin>