Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi drbd developers, After some research and tests I feel I found the reason of this problem and a possible fix in drbd. Would you please check if my theory is correct? Let me use 8.4.6 as the code base when I explain it. When conn_disconnect hang it is hanging at line drbd_receiver.c:5178 static int drbd_disconnected(struct drbd_peer_device *peer_device) { ........ wait_event(device->misc_wait, !test_bit(BITMAP_IO, &device->flags)); } The reason is device has flag BITMAP_IO set. The reason why flag BITMAP_IO is set and not clear is: Disk state changes when network is disconnected and after_state_ch is called. At drbd_state.c line 1949 drbd_queue_bitmap_io is called inafter_state_ch() . I think the real reason is in drbd_queue_bitmap_io. drbd_main.c line 3641. void drbd_queue_bitmap_io(struct drbd_device *device, int (*io_fn)(struct drbd_device *), void (*done)(struct drbd_device *, int), char *why, enum bm_flag flags) { ......... set_bit(BITMAP_IO, &device->flags); if (atomic_read(&device->ap_bio_cnt) == 0) { if (!test_and_set_bit(BITMAP_IO_QUEUED, &device->flags)) drbd_queue_work(&first_peer_device(device)->connection->sender_work, &device->bm_io_work.w); } ........ } In the code the only code to clear BITMAP_IO is in device->bm_io_work.w(w_bitmap_io). But when atomic_read(&device->ap_bio_cnt) != 0 the flag BITMAP_IO is set, however bm_io_work.w is not called. Then drbd_disconnected() is blocked. Should we move set_bit(BITMAP_IO, &device->flags) to the front of drbd_queue_work()? Thanks a lot! Fang On Thu, Apr 9, 2015 at 12:56 PM, Fang Sun <sunf2002 at gmail.com> wrote: > Hi drbd team, > > I am running into a drbd problem recently and I hope I can get some help > from you. > > This problem can be reproduced in 8.4.4,8.4.5 and 8.4.6. > > I have a 2 nodes cluster. There are 2 disks. One disk is upToDate and the > other is syncing. > I cut the network on standby when one disk is syncing. > I configured fencing=resource-and-stonith, and I expect my drbd fencing is > called when network is shutdown. This always works as expected if both disk > are UpToDate when I shutdown network on standby. > But when one disk is syncing this caused the drbd to suspend both disks > and drbd fencing isn't called. And I can see drbd read process is put into > D state. > > Some logs on primary: > > Apr 6 16:53:49 shrvm219 kernel: drbd cic: PingAck did not arrive in time. > > Apr 6 16:53:49 shrvm219 kernel: block drbd1: conn( SyncSource -> > NetworkFailure ) > > Apr 6 16:53:49 shrvm219 kernel: block drbd2: conn( Connected -> > NetworkFailure ) pdsk( UpToDate -> DUnknown ) > > Apr 6 16:53:49 shrvm219 kernel: drbd cic: peer( Secondary -> Unknown ) > susp( 0 -> 1 ) > > Apr 6 16:53:49 shrvm219 kernel: drbd cic: susp( 0 -> 1 ) > > Apr 6 16:53:49 shrvm219 kernel: drbd cic: asender terminated > > Apr 6 16:53:49 shrvm219 kernel: drbd cic: Terminating drbd_a_cic > > >>>There isn’t Connection closed > > Apr 6 16:57:28 shrvm220 kernel: INFO: task xfsalloc/0:805 blocked for > more than 120 seconds. > > > > Apr 3 20:57:36 shrvm220 kernel: INFO: task xfsaild/drbd2:16778 blocked > for more than 120 seconds. > > > [root at shrvm219 ~]# ps -ax | grep drbd > > Warning: bad syntax, perhaps a bogus '-'? See > /usr/share/doc/procps-3.2.8/FAQ > > 5683 ? S 0:00 [drbd-reissue] > > 11386 pts/0 S+ 0:00 grep drbd > > 12098 ? S 0:00 [drbd_submit] > > 12103 ? S 0:00 [drbd_submit] > > 12118 ? S 0:02 [drbd_w_cic] > > 12139 ? D 0:03 [drbd_r_cic] > > 12847 ? S 0:00 [xfsbufd/drbd2] > > 12848 ? S 0:00 [xfs-cil/drbd2] > > 12849 ? D 0:00 [xfssyncd/drbd2] > > 12850 ? S 0:02 [xfsaild/drbd2] > > 13359 ? S 0:00 [xfsbufd/drbd1] > > 13360 ? S 0:00 [xfs-cil/drbd1] > > 13361 ? D 0:00 [xfssyncd/drbd1] > > 13362 ? S 0:02 [xfsaild/drbd1] > > > This is part of kernel stack, as I thought drbd is stuck at > conn_disconnect. > > > > Apr 7 19:34:08 shrvm219 kernel: SysRq : Show Blocked State > > Apr 7 19:34:08 shrvm219 kernel: task PC stack > pid father > > Apr 7 19:34:08 shrvm219 kernel: drbd_r_cic D 0000000000000000 0 > 12139 2 0x00000084 > > Apr 7 19:34:08 shrvm219 kernel: ffff88023886fd90 0000000000000046 > 0000000000000000 ffff88023886fd54 > > Apr 7 19:34:08 shrvm219 kernel: ffff88023886fd20 ffff88023fc23040 > 000002b2f338c2ce ffff8800283158c0 > > Apr 7 19:34:08 shrvm219 kernel: 00000000000005ff 000000010028bb73 > ffff88023ab87058 ffff88023886ffd8 > > Apr 7 19:34:08 shrvm219 kernel: Call Trace: > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109ee2e>] ? > prepare_to_wait+0x4e/0x80 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa0394d4d>] > conn_disconnect+0x22d/0x4f0 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109eb00>] ? > autoremove_wake_function+0x0/0x40 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa0395120>] > drbd_receiver+0x110/0x220 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa03a69e0>] ? > drbd_thread_setup+0x0/0x110 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa03a6a0d>] > drbd_thread_setup+0x2d/0x110 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa03a69e0>] ? > drbd_thread_setup+0x0/0x110 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109e66e>] kthread+0x9e/0xc0 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 > > Apr 7 19:34:08 shrvm219 kernel: xfssyncd/drbd D 0000000000000001 0 > 12849 2 0x00000080 > > Apr 7 19:34:08 shrvm219 kernel: ffff880203387ad0 0000000000000046 > 0000000000000000 ffff880203387a94 > > Apr 7 19:34:08 shrvm219 kernel: ffff880203387a30 ffff88023fc23040 > 000002b60ef7decd ffff8800283158c0 > > Apr 7 19:34:08 shrvm219 kernel: 0000000000000400 000000010028ef9d > ffff88023921bab8 ffff880203387fd8 > > Apr 7 19:34:08 shrvm219 kernel: Call Trace: > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa039c947>] > drbd_make_request+0x197/0x330 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109eb00>] ? > autoremove_wake_function+0x0/0x40 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff81270810>] > generic_make_request+0x240/0x5a0 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa039cbe9>] ? > drbd_merge_bvec+0x109/0x2a0 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff81270be0>] submit_bio+0x70/0x120 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff81064b90>] ? > default_wake_function+0x0/0x20 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa0208bba>] > _xfs_buf_ioapply+0x16a/0x200 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa01ef50a>] ? > xlog_bdstrat+0x2a/0x60 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa020a87f>] > xfs_buf_iorequest+0x4f/0xe0 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa01ef50a>] > xlog_bdstrat+0x2a/0x60 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa01f0ce9>] > xlog_sync+0x269/0x3e0 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa01f0f13>] > xlog_state_release_iclog+0xb3/0xf0 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa01f13a2>] > _xfs_log_force+0x122/0x240 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa01f1688>] > xfs_log_force+0x38/0x90 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa0214a02>] > xfs_sync_worker+0x52/0xa0 [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa021491e>] xfssyncd+0x17e/0x210 > [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa02147a0>] ? xfssyncd+0x0/0x210 > [xfs] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109e66e>] kthread+0x9e/0xc0 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 > > Apr 7 19:34:08 shrvm219 kernel: xfssyncd/drbd D 0000000000000001 0 > 13361 2 0x00000080 > > Apr 7 19:34:08 shrvm219 kernel: ffff8802033edad0 0000000000000046 > 0000000000000000 ffff8802033eda94 > > Apr 7 19:34:08 shrvm219 kernel: 0000000000000000 ffff88023fc23040 > 000002b60ef68ea3 ffff8800283158c0 > > Apr 7 19:34:08 shrvm219 kernel: 00000000000007fe 000000010028ef9d > ffff880239c41058 ffff8802033edfd8 > > Apr 7 19:34:08 shrvm219 kernel: Call Trace: > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa039c947>] > drbd_make_request+0x197/0x330 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff8109eb00>] ? > autoremove_wake_function+0x0/0x40 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff81270810>] > generic_make_request+0x240/0x5a0 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffffa039cbe9>] ? > drbd_merge_bvec+0x109/0x2a0 [drbd] > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff81270be0>] submit_bio+0x70/0x120 > > Apr 7 19:34:08 shrvm219 kernel: [<ffffffff81064b90>] ? > default_wake_function+0x0/0x20 > > > Thanks a lot in advance. > > > Fang > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20150413/ab4c3849/attachment.htm>