Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Everyone,
Thanks for the reply Robert.
We are giving the ample amount of time to complete the previous detach 
operation, approx 10-40 seconds detach followed by the attach 
operations. It takes approx 3-5 hours for ASSERT (in put_ldev) got 
triggered with the attached test script.
I have added debug prints in get_ldev & put_ldev (while exiting) to keep 
track of the 'local_cnt'and added a kernel panic when the 'local_cnt' 
reference goes negative, the issue is reproducing during attach 
operation. Observed that the extra reference of the 'local_ldev' is 
decremented by the drbd_md_endio() routine.
--------------------------------------------------------------------
[ 8819.467228] block drbd6: disk( Failed -> Diskless )
[ 8819.467738] block drbd6: EXIT: put_ldev is called from 
after_state_ch, count=0 ds=0 thread=drbd_w_r4[2336]
[ 8825.141885] block drbd6: role( Secondary -> Primary )
[ 8826.838617] EXT3 FS on drbd6, internal journal
[ 8828.132369] block drbd6: role( Primary -> Secondary )
/* drbd_md_endio entry */
[ 8835.585126] block drbd6: drbd_md_endio: meta-data in used by 
drbd_md_read
[ 8835.587660] block drbd6: disk( Diskless -> Attaching )
[ 8835.587675] block drbd6: EXIT: put_ldev is called from 
__get_ldev_if_state, count=0 ds=1 thread=drbdsetup-84[11848]
[ 8835.587701] block drbd6: EXIT: get_ldev is called from 
drbd_adm_attach, count=1 ds=1 wds=1 thread=drbdsetup-84[11848]
[ 8835.587880] block drbd6: EXIT: put_ldev is called from 
__get_ldev_if_state, count=1 ds=1 thread=drbdsetup-84[11848]
[ 8835.587884] block drbd6: This kernel is too old, no WRITE_SAME support.
[ 8835.587890] block drbd6: EXIT: get_ldev is called from bm_rw, count=2 
ds=1 wds=1 thread=drbdsetup-84[11848]
/* drbd_md_endio exit, problem here */
[ 8835.587969] block drbd6: EXIT: put_ldev is called from drbd_md_endio, 
count=1 ds=1 thread=swapper[0]
[ 8835.590273] block drbd6: recounting of set bits took additional 0 jiffies
[ 8835.590276] block drbd6: 47 MB (12070 bits) marked out-of-sync by on 
disk bit-map.
[ 8835.590279] block drbd6: EXIT: put_ldev is called from 
drbd_bm_aio_ctx_destroy, count=0 ds=1 thread=drbdsetup-84[11848]
[ 8835.590284] block drbd6: EXIT: put_ldev is called from 
__get_ldev_if_state, count=0 ds=1 thread=drbdsetup-84[11848]
[ 8835.590286] block drbd6: EXIT: put_ldev is called from 
__get_ldev_if_state, count=0 ds=1 thread=drbdsetup-84[11848]
[ 8835.590289] block drbd6: disk( Attaching -> Negotiating )
[ 8835.590296] block drbd6: EXIT: get_ldev is called from 
drbd_print_uuids, count=1 ds=3 wds=3 thread=drbdsetup-84[11848]
[ 8835.590298] block drbd6: attached to UUIDs 
0000000000000004:0000000000000000:0000000000000000:0000000000000000
[ 8835.590300] block drbd6: EXIT: put_ldev is called from 
drbd_print_uuids, count=0 ds=3 thread=drbdsetup-84[11848]
[ 8835.590302] block drbd6: EXIT: put_ldev is called from 
__get_ldev_if_state, count=0 ds=3 thread=drbdsetup-84[11848]
[ 8835.590306] block drbd6: EXIT: get_ldev is called from drbd_md_sync, 
count=1 ds=3 wds=2 thread=drbdsetup-84[11848]
[ 8835.590310] block drbd6: EXIT: get_ldev is called from 
_drbd_md_sync_page_io, count=2 ds=3 wds=1 thread=drbdsetup-84[11848]
[ 8835.607888] block drbd6: drbd_md_endio: meta-data in used by drbd_md_sync
[ 8835.607912] block drbd6: EXIT: put_ldev is called from drbd_md_endio, 
count=1 ds=3 thread=swapper[0]
[ 8835.607927] block drbd6: EXIT: put_ldev is called from drbd_md_sync, 
count=0 ds=3 thread=drbdsetup-84[11848]
[ 8835.608016] block drbd6: ASSERT( i >= 0 ) in 
drivers/block/drbd/drbd_int.h:2287
[ 8835.608030] Kernel panic - not syncing: drbd_assert_breakpoint
[ 8835.608036] Pid: 11848, comm: drbdsetup-84 Not tainted 2.6.32.59+ #62
[ 8835.608038] Call Trace:
[ 8835.608127]  [<ffffffff8160bc03>] panic+0xfc/0x1c2
[ 8835.609436]  [<ffffffffa01baf04>] drbd_assert_breakpoint+0xa4/0xb0 [drbd]
[ 8835.609449]  [<ffffffffa01e4398>] __put_ldev+0x198/0x1c0 [drbd]
[ 8835.609458]  [<ffffffffa01f1d4d>] drbd_adm_attach+0xf1d/0x10f0 [drbd]
[ 8835.609476]  [<ffffffff812f451f>] ? nla_parse+0xef/0x110
[ 8835.609489]  [<ffffffff8151a186>] genl_rcv_msg+0x1e6/0x220
[ 8835.609495]  [<ffffffff81519fa0>] ? genl_rcv_msg+0x0/0x220
[ 8835.609498]  [<ffffffff81516fd9>] netlink_rcv_skb+0xa9/0xd0
[ 8835.609502]  [<ffffffff815189bc>] genl_rcv+0x2c/0x40
[ 8835.609506]  [<ffffffff81516d23>] netlink_unicast+0x2b3/0x2c0
[ 8835.609510]  [<ffffffff81517e09>] netlink_sendmsg+0x269/0x380
[ 8835.609520]  [<ffffffff81196b5a>] ? __blkdev_get+0x1da/0x410
[ 8835.609529]  [<ffffffff814e0644>] sock_aio_write+0x124/0x1a0
[ 8835.609536]  [<ffffffff812db6e7>] ? kobject_put+0x27/0x60
[ 8835.609549]  [<ffffffff81165d7a>] do_sync_write+0xfa/0x140
[ 8835.609563]  [<ffffffff810b5ce0>] ? autoremove_wake_function+0x0/0x40
[ 8835.609567]  [<ffffffff81167caf>] ? __fput+0x19f/0x210
[ 8835.609572]  [<ffffffff81166124>] vfs_write+0x184/0x1a0
[ 8835.609575]  [<ffffffff81167525>] ? fget_light+0x15/0xc0
[ 8835.609578]  [<ffffffff811669fc>] sys_write+0x5c/0xf0
[ 8835.609592]  [<ffffffff81073a70>] sysenter_dispatch+0x7/0x2e
--------------------------------------------------------------------
The drbd_md_endio() routine release the ldev reference even called from 
drbd_md_read(). Which should not happen as ldev reference is not taken 
in drbd_md_read() routine. The drbd_md_endio() calls 
drbd_md_put_buffer() which release the mdio usage count (i.e. 
mdio.in_use) and enables the drbd_adm_attach() to proceed further.
--------------------------------------------------------------------
BIO_ENDIO_TYPE drbd_md_endio BIO_ENDIO_ARGS(struct bio *bio, int error)
{
         ...
         drbd_md_put_buffer(device);
         device->md_io.done = 1;
         wake_up(&device->misc_wait);
         bio_put(bio);
         if (device->ldev) /* special case: drbd_md_read() during 
drbd_adm_attach() */
                 put_ldev(device);
         BIO_ENDIO_FN_RETURN;
}
--------------------------------------------------------------------
So when drbd_md_endio() calls the put_ldev(), it has valid 
'device->ldev' reference and it decrement the 'local_cnt' by calling the 
put_ldev(). Which brings the inconsistancy in the 'local_cnt'. And 
latter when put_ldev() called from drbd_adm_attach() triggers the ASSERT.
To make sure the put_ldev() is not getting called when called from the 
drbd_md_read(), I have added the below check
--------------------------------------------------------------------
diff --git a/drivers/block/drbd/drbd_worker.c
b/drivers/block/drbd/drbd_worker.c
index fc9d266e9b40..0292bf16f4df 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -82,8 +82,12 @@ BIO_ENDIO_TYPE drbd_md_endio BIO_ENDIO_ARGS(struct 
bio *bio,
int error)
         device->md_io.done = 1;
         wake_up(&device->misc_wait);
         bio_put(bio);
-       if (device->ldev) /* special case: drbd_md_read() during 
drbd_adm_attach() */
+       /* special case: don't call put_ldev when endio initiated
+          from drbd_md_read() during drbd_adm_attach() */
+       if (device->ldev && 
strcmp(device->md_io.current_use,"drbd_md_read")) {
+               drbd_info(device, "%s: meta-data is used by 
%s\n",__func__,device->md_io.current_use);
                 put_ldev(device);
+       }
         BIO_ENDIO_FN_RETURN;
  }
--------------------------------------------------------------------
After the above changes the ASSERT is not getting triggered.
Need you help in figure out the possible regression due to above changes.
Or is there any other way to fix this issue ?
Regards
Sunil Kumar
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_drbd.sh
Type: application/x-shellscript
Size: 1117 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20170223/fd3bea57/attachment.bin>