Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Dear drbd developers, Sorry, it's me again. I am really trying to fix my problem and I hope I can get some help from you. Let me describe the problem again. I have 2 devices. One of them is UpToDate on both nodes. One of them if Inconsistent on standby node and it is syncing. Now I shutdown the network on standby. Because I configured fencing to resource-and-stonith I am expecting primary node will call my fencing handler. But on primary conn_disconnect is hanging for ever. By my research drbd_disconnected() is hanging at "wait_event(device->misc_wait, !test_bit(BITMAP_IO, &device->flags));" But no matter how long I wait w_bitmap_io is not called and BITMAP_IO is not cleared. Instead of the previous change I suggested I tried another change which fixed my problem. drbd_disconnected() { if (!drbd_suspended(device) tl_clear(peer_device->connection); >>>> >>>> I changed these 2 lines to if (!drbd_suspended(device) || test_bit(BITMAP_IO, &device->flags)) tl_clear(peer_device->connection); ........ wait_event(device->misc_wait, !test_bit(BITMAP_IO, &device->flags)); ....... } Don't you think if the device is suspended when there are requests going on it's possible that the requests will hang there and BITMAP_IO will never be cleared? This is what happening on my drbd. Or do you think there is something wrong with my configuration or operations? Thanks again for any help or suggestions. Fang On Thu, Apr 9, 2015 at 12:56 PM, Fang Sun <sunf2002 at gmail.com> wrote: > Hi drbd team, > > I am running into a drbd problem recently and I hope I can get some help > from you. > > This problem can be reproduced in 8.4.4,8.4.5 and 8.4.6. > > I have a 2 nodes cluster. There are 2 disks. One disk is upToDate and the > other is syncing. > I cut the network on standby when one disk is syncing. > I configured fencing=resource-and-stonith, and I expect my drbd fencing is > called when network is shutdown. This always works as expected if both disk > are UpToDate when I shutdown network on standby. > But when one disk is syncing this caused the drbd to suspend both disks > and drbd fencing isn't called. And I can see drbd read process is put into > D state. > > Some logs on primary: > > Apr 6 16:53:49 shrvm219 kernel: drbd cic: PingAck did not arrive in time. > > Apr 6 16:53:49 shrvm219 kernel: block drbd1: conn( SyncSource -> > NetworkFailure ) > > Apr 6 16:53:49 shrvm219 kernel: block drbd2: conn( Connected -> > NetworkFailure ) pdsk( UpToDate -> DUnknown ) > > Apr 6 16:53:49 shrvm219 kernel: drbd cic: peer( Secondary -> Unknown ) > susp( 0 -> 1 ) > > Apr 6 16:53:49 shrvm219 kernel: drbd cic: susp( 0 -> 1 ) > > Apr 6 16:53:49 shrvm219 kernel: drbd cic: asender terminated > > Apr 6 16:53:49 shrvm219 kernel: drbd cic: Terminating drbd_a_cic > > >>>There isn’t Connection closed > > Apr 6 16:57:28 shrvm220 kernel: INFO: task xfsalloc/0:805 blocked for > more than 120 seconds. > > > > Apr 3 20:57:36 shrvm220 kernel: INFO: task xfsaild/drbd2:16778 blocked > for more than 120 seconds. > > > [root at shrvm219 ~]# ps -ax | grep drbd > > Warning: bad syntax, perhaps a bogus '-'? See > /usr/share/doc/procps-3.2.8/FAQ > > 5683 ? S 0:00 [drbd-reissue] > > 11386 pts/0 S+ 0:00 grep drbd > > 12098 ? S 0:00 [drbd_submit] > > 12103 ? S 0:00 [drbd_submit] > > 12118 ? S 0:02 [drbd_w_cic] > > 12139 ? D 0:03 [drbd_r_cic] > > 12847 ? S 0:00 [xfsbufd/drbd2] > > 12848 ? S 0:00 [xfs-cil/drbd2] > > 12849 ? D 0:00 [xfssyncd/drbd2] > > 12850 ? S 0:02 [xfsaild/drbd2] > > 13359 ? S 0:00 [xfsbufd/drbd1] > > 13360 ? S 0:00 [xfs-cil/drbd1] > > 13361 ? D 0:00 [xfssyncd/drbd1] > > 13362 ? S 0:02 [xfsaild/drbd1] > > > This is part of kernel stack, as I thought drbd is stuck at > conn_disconnect. > > > > Apr 7 19:34:08 shrvm219 kernel: SysRq : Show Blocked State > > Apr 7 19:34:08 shrvm219 kernel: task PC stack > pid father > > Apr 7 19:34:08 shrvm219 kernel: drbd_r_cic D 0000000000000000 0 > 12139 2 0x00000084 > > Apr 7 19:34:08 shrvm219 kernel: ffff88023886fd90 0000000000000046 > 0000000000000000 ffff88023886fd54 > > Apr 7 19:34:08 shrvm219 kernel: ffff88023886fd20 ffff88023fc23040 > 000002b2f338c2ce ffff8800283158c0 > > Apr 7 19:34:08 shrvm219 kernel: 00000000000005ff 000000010028bb73 > ffff88023ab87058 ffff88023886ffd8 > > Apr 7 19:34:08 shrvm219 kernel: Call Trace: > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109ee2e>] ? > prepare_to_wait+0x4e/0x80 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa0394d4d>] > conn_disconnect+0x22d/0x4f0 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109eb00>] ? > autoremove_wake_function+0x0/0x40 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa0395120>] > drbd_receiver+0x110/0x220 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa03a69e0>] ? > drbd_thread_setup+0x0/0x110 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa03a6a0d>] > drbd_thread_setup+0x2d/0x110 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa03a69e0>] ? > drbd_thread_setup+0x0/0x110 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109e66e>] kthread+0x9e/0xc0 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 > > Apr 7 19:34:08 shrvm219 kernel: xfssyncd/drbd D 0000000000000001 0 > 12849 2 0x00000080 > > Apr 7 19:34:08 shrvm219 kernel: ffff880203387ad0 0000000000000046 > 0000000000000000 ffff880203387a94 > > Apr 7 19:34:08 shrvm219 kernel: ffff880203387a30 ffff88023fc23040 > 000002b60ef7decd ffff8800283158c0 > > Apr 7 19:34:08 shrvm219 kernel: 0000000000000400 000000010028ef9d > ffff88023921bab8 ffff880203387fd8 > > Apr 7 19:34:08 shrvm219 kernel: Call Trace: > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa039c947>] > drbd_make_request+0x197/0x330 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109eb00>] ? > autoremove_wake_function+0x0/0x40 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff81270810>] > generic_make_request+0x240/0x5a0 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa039cbe9>] ? > drbd_merge_bvec+0x109/0x2a0 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff81270be0>] submit_bio+0x70/0x120 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff81064b90>] ? > default_wake_function+0x0/0x20 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa0208bba>] > _xfs_buf_ioapply+0x16a/0x200 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa01ef50a>] ? > xlog_bdstrat+0x2a/0x60 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa020a87f>] > xfs_buf_iorequest+0x4f/0xe0 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa01ef50a>] > xlog_bdstrat+0x2a/0x60 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa01f0ce9>] > xlog_sync+0x269/0x3e0 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa01f0f13>] > xlog_state_release_iclog+0xb3/0xf0 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa01f13a2>] > _xfs_log_force+0x122/0x240 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa01f1688>] > xfs_log_force+0x38/0x90 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa0214a02>] > xfs_sync_worker+0x52/0xa0 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa021491e>] xfssyncd+0x17e/0x210 > [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa02147a0>] ? xfssyncd+0x0/0x210 > [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109e66e>] kthread+0x9e/0xc0 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 > > Apr 7 19:34:08 shrvm219 kernel: xfssyncd/drbd D 0000000000000001 0 > 13361 2 0x00000080 > > Apr 7 19:34:08 shrvm219 kernel: ffff8802033edad0 0000000000000046 > 0000000000000000 ffff8802033eda94 > > Apr 7 19:34:08 shrvm219 kernel: 0000000000000000 ffff88023fc23040 > 000002b60ef68ea3 ffff8800283158c0 > > Apr 7 19:34:08 shrvm219 kernel: 00000000000007fe 000000010028ef9d > ffff880239c41058 ffff8802033edfd8 > > Apr 7 19:34:08 shrvm219 kernel: Call Trace: > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa039c947>] > drbd_make_request+0x197/0x330 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109eb00>] ? > autoremove_wake_function+0x0/0x40 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff81270810>] > generic_make_request+0x240/0x5a0 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa039cbe9>] ? > drbd_merge_bvec+0x109/0x2a0 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff81270be0>] submit_bio+0x70/0x120 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff81064b90>] ? > default_wake_function+0x0/0x20 > > > Thanks a lot in advance. > > > Fang > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150415/2e21e08c/attachment.htm>