[Drbd-dev] [CASE-29] After re-connect, WFBitMapS-WFBitMapT status has sustained continuously and copy command hangs

Thu Apr 21 15:57:36 CEST 2016

On Thu, Apr 21, 2016 at 10:48:11PM +0900, Jaeheon Kim wrote:
> Hi,
> 
> We wrote some temporary solution to avoid file copy hang problem.
> We inserted wake_up function for sender_work queue after  _req_mod(req,
> QUEUE_FOR_SEND_OOS, peer_device).
> Please check following code in drbd_process_write_request function.

Please try to send unified diffs.
maybe git diff, even.

> 
> drbd_process_write_request ()
> {
> 
>   ........
> 
> 
> } else if (drbd_set_out_of_sync(peer_device, req->i.sector, req->i.size))
> 
> #ifdef _WIN32_V9 // Windows DRBD
> {
>     _req_mod(req, QUEUE_FOR_SEND_OOS, peer_device);
>     if(peer_device->repl_state[NOW] == L_WF_BITMAP_S)
>     {
>         wake_up(&peer_device->connection->sender_work.q_wait);
>     }
> }
> #else
>     _req_mod(req, QUEUE_FOR_SEND_OOS, peer_device); // Linux Org
> #endif
> 
> }
> 
> What do you think about this idea?

You are correct,
if all established replication links are "ahead",
and not a single link actually gets the data,
we may miss the wake up of the sender.

better fix is probably

diff --git a/drbd/drbd_req.c b/drbd/drbd_req.c
index 3159de8..ae4bbd6 100644
--- a/drbd/drbd_req.c
+++ b/drbd/drbd_req.c
@@ -1666,8 +1666,7 @@ static void drbd_send_and_submit(struct drbd_device *device, struct drbd_request
 		}
 		if (!drbd_process_write_request(req))
 			no_remote = true;
-		else
-			wake_all_senders(resource);
+		wake_all_senders(resource);
 	} else {
 		if (peer_device) {
 			_req_mod(req, TO_BE_SENT, peer_device);

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT