[DRBD-user] A question about drbd hang in conn_disconnect

Fang Sun sunf2002 at gmail.com
Thu Apr 16 00:54:25 CEST 2015

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Dear drbd developers,

Sorry, it's me again. I am really trying to fix my problem and I hope I can
get some help from you.

Let me describe the problem again.
I have 2 devices. One of them is UpToDate on both nodes. One of them if
Inconsistent on standby node and it is syncing. Now I shutdown the network
on standby. Because I configured fencing to resource-and-stonith I am
expecting primary node will call my fencing handler.
But on primary conn_disconnect is hanging for ever.
By my research drbd_disconnected() is hanging at "wait_event(device->misc_wait,
!test_bit(BITMAP_IO, &device->flags));" But no matter how long I wait
w_bitmap_io is not called and BITMAP_IO is not cleared.

Instead of the previous change I suggested I tried another change which
fixed my problem.

drbd_disconnected()
{
if (!drbd_suspended(device)
  tl_clear(peer_device->connection);
>>>>
>>>> I changed these 2 lines to
if (!drbd_suspended(device) || test_bit(BITMAP_IO, &device->flags))
  tl_clear(peer_device->connection);

........
wait_event(device->misc_wait, !test_bit(BITMAP_IO, &device->flags));
.......
}

Don't you think if the device is suspended when there are requests going on
it's possible that the requests will hang there and BITMAP_IO will never be
cleared? This is what happening on my drbd.
Or do you think there is something wrong with my configuration or
operations?

Thanks again for any help or suggestions.

Fang




On Thu, Apr 9, 2015 at 12:56 PM, Fang Sun <sunf2002 at gmail.com> wrote:

> Hi drbd team,
>
> I am running into a drbd problem recently and I hope I can get some help
> from you.
>
> This problem can be reproduced in 8.4.4,8.4.5 and 8.4.6.
>
> I have a 2 nodes cluster. There are 2 disks. One disk is upToDate and the
> other is syncing.
> I cut the network on standby when one disk is syncing.
> I configured fencing=resource-and-stonith, and I expect my drbd fencing is
> called when network is shutdown. This always works as expected if both disk
> are UpToDate when I shutdown network on standby.
> But when one disk is syncing this caused the drbd to suspend both disks
> and drbd fencing isn't called. And I can see drbd read process is put into
> D state.
>
> Some logs on primary:
>
> Apr  6 16:53:49 shrvm219 kernel: drbd cic: PingAck did not arrive in time.
>
> Apr  6 16:53:49 shrvm219 kernel: block drbd1: conn( SyncSource ->
> NetworkFailure )
>
> Apr  6 16:53:49 shrvm219 kernel: block drbd2: conn( Connected ->
> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>
> Apr  6 16:53:49 shrvm219 kernel: drbd cic: peer( Secondary -> Unknown )
> susp( 0 -> 1 )
>
> Apr  6 16:53:49 shrvm219 kernel: drbd cic: susp( 0 -> 1 )
>
> Apr  6 16:53:49 shrvm219 kernel: drbd cic: asender terminated
>
> Apr  6 16:53:49 shrvm219 kernel: drbd cic: Terminating drbd_a_cic
>
> >>>There isn’t Connection closed
>
> Apr  6 16:57:28 shrvm220 kernel: INFO: task xfsalloc/0:805 blocked for
> more than 120 seconds.
>
>
>
> Apr  3 20:57:36 shrvm220 kernel: INFO: task xfsaild/drbd2:16778 blocked
> for more than 120 seconds.
>
>
> [root at shrvm219 ~]# ps -ax | grep drbd
>
> Warning: bad syntax, perhaps a bogus '-'? See
> /usr/share/doc/procps-3.2.8/FAQ
>
> 5683 ?        S      0:00 [drbd-reissue]
>
> 11386 pts/0    S+     0:00 grep drbd
>
> 12098 ?        S      0:00 [drbd_submit]
>
> 12103 ?        S      0:00 [drbd_submit]
>
> 12118 ?        S      0:02 [drbd_w_cic]
>
> 12139 ?        D      0:03 [drbd_r_cic]
>
> 12847 ?        S      0:00 [xfsbufd/drbd2]
>
> 12848 ?        S      0:00 [xfs-cil/drbd2]
>
> 12849 ?        D      0:00 [xfssyncd/drbd2]
>
> 12850 ?        S      0:02 [xfsaild/drbd2]
>
> 13359 ?        S      0:00 [xfsbufd/drbd1]
>
> 13360 ?        S      0:00 [xfs-cil/drbd1]
>
> 13361 ?        D      0:00 [xfssyncd/drbd1]
>
> 13362 ?        S      0:02 [xfsaild/drbd1]
>
>
> This is part of kernel stack, as I thought drbd is stuck at
> conn_disconnect.
>
>
>
> Apr  7 19:34:08 shrvm219 kernel: SysRq : Show Blocked State
>
> Apr  7 19:34:08 shrvm219 kernel:  task                        PC stack
> pid father
>
> Apr  7 19:34:08 shrvm219 kernel: drbd_r_cic    D 0000000000000000     0
> 12139      2 0x00000084
>
> Apr  7 19:34:08 shrvm219 kernel: ffff88023886fd90 0000000000000046
> 0000000000000000 ffff88023886fd54
>
> Apr  7 19:34:08 shrvm219 kernel: ffff88023886fd20 ffff88023fc23040
> 000002b2f338c2ce ffff8800283158c0
>
> Apr  7 19:34:08 shrvm219 kernel: 00000000000005ff 000000010028bb73
> ffff88023ab87058 ffff88023886ffd8
>
> Apr  7 19:34:08 shrvm219 kernel: Call Trace:
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff8109ee2e>] ?
> prepare_to_wait+0x4e/0x80
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa0394d4d>]
> conn_disconnect+0x22d/0x4f0 [drbd]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff8109eb00>] ?
> autoremove_wake_function+0x0/0x40
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa0395120>]
> drbd_receiver+0x110/0x220 [drbd]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa03a69e0>] ?
> drbd_thread_setup+0x0/0x110 [drbd]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa03a6a0d>]
> drbd_thread_setup+0x2d/0x110 [drbd]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa03a69e0>] ?
> drbd_thread_setup+0x0/0x110 [drbd]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff8109e66e>] kthread+0x9e/0xc0
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
>
> Apr  7 19:34:08 shrvm219 kernel: xfssyncd/drbd D 0000000000000001     0
> 12849      2 0x00000080
>
> Apr  7 19:34:08 shrvm219 kernel: ffff880203387ad0 0000000000000046
> 0000000000000000 ffff880203387a94
>
> Apr  7 19:34:08 shrvm219 kernel: ffff880203387a30 ffff88023fc23040
> 000002b60ef7decd ffff8800283158c0
>
> Apr  7 19:34:08 shrvm219 kernel: 0000000000000400 000000010028ef9d
> ffff88023921bab8 ffff880203387fd8
>
> Apr  7 19:34:08 shrvm219 kernel: Call Trace:
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa039c947>]
> drbd_make_request+0x197/0x330 [drbd]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff8109eb00>] ?
> autoremove_wake_function+0x0/0x40
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff81270810>]
> generic_make_request+0x240/0x5a0
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa039cbe9>] ?
> drbd_merge_bvec+0x109/0x2a0 [drbd]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff81270be0>] submit_bio+0x70/0x120
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff81064b90>] ?
> default_wake_function+0x0/0x20
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa0208bba>]
> _xfs_buf_ioapply+0x16a/0x200 [xfs]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa01ef50a>] ?
> xlog_bdstrat+0x2a/0x60 [xfs]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa020a87f>]
> xfs_buf_iorequest+0x4f/0xe0 [xfs]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa01ef50a>]
> xlog_bdstrat+0x2a/0x60 [xfs]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa01f0ce9>]
> xlog_sync+0x269/0x3e0 [xfs]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa01f0f13>]
> xlog_state_release_iclog+0xb3/0xf0 [xfs]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa01f13a2>]
> _xfs_log_force+0x122/0x240 [xfs]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa01f1688>]
> xfs_log_force+0x38/0x90 [xfs]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa0214a02>]
> xfs_sync_worker+0x52/0xa0 [xfs]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa021491e>] xfssyncd+0x17e/0x210
> [xfs]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa02147a0>] ? xfssyncd+0x0/0x210
> [xfs]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff8109e66e>] kthread+0x9e/0xc0
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
>
> Apr  7 19:34:08 shrvm219 kernel: xfssyncd/drbd D 0000000000000001     0
> 13361      2 0x00000080
>
> Apr  7 19:34:08 shrvm219 kernel: ffff8802033edad0 0000000000000046
> 0000000000000000 ffff8802033eda94
>
> Apr  7 19:34:08 shrvm219 kernel: 0000000000000000 ffff88023fc23040
> 000002b60ef68ea3 ffff8800283158c0
>
> Apr  7 19:34:08 shrvm219 kernel: 00000000000007fe 000000010028ef9d
> ffff880239c41058 ffff8802033edfd8
>
> Apr  7 19:34:08 shrvm219 kernel: Call Trace:
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa039c947>]
> drbd_make_request+0x197/0x330 [drbd]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff8109eb00>] ?
> autoremove_wake_function+0x0/0x40
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff81270810>]
> generic_make_request+0x240/0x5a0
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffffa039cbe9>] ?
> drbd_merge_bvec+0x109/0x2a0 [drbd]
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff81270be0>] submit_bio+0x70/0x120
>
> Apr  7 19:34:08 shrvm219 kernel: [<ffffffff81064b90>] ?
> default_wake_function+0x0/0x20
>
>
> Thanks a lot in advance.
>
>
> Fang
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150415/2e21e08c/attachment.htm>


More information about the drbd-user mailing list