[DRBD-user] High iowait on primary DRBD node with large sustained writes and replication enabled to secondary

Thu Jan 10 21:28:57 CET 2013

Sebastian,
Thank you for your analysis.

I will use tcpdump to follow the network traffic during replication between the nodes and see if I can work out whether it is the network causing the latency.

At this stage given these servers are in production I am limited to what I can change.

At least I have a workaround.  I know that during restores I can disconnect the resource where the restore is writing to and then reconnect it after the restore is complete and then let syncer do its work.

This will keep the iowait low and will not interrupt the virtual machines on that resource.

I would like to know the root cause(s) of the latency so I will keep chipping away at it.

Regards

Paul

> -----Original Message-----
> From: Sebastian Riemer [mailto:sebastian.riemer at profitbricks.com]
> Sent: Wednesday, 9 January 2013 9:35 PM
> To: Paul Freeman
> Cc: drbd-user at lists.linbit.com
> Subject: Re: [DRBD-user] High iowait on primary DRBD node with large
> sustained writes and replication enabled to secondary
> 
> On 09.01.2013 04:51, Paul Freeman wrote:
> ...
> > Analysis:
> > 1. In both connected and disconnected cases, the write sizes used by
> DRBD are 128KiB so I think this is OK.  At least they are not 4KiB which
> would indicate the IO problem you mention.
> >
> > 2. In the connected case there is a significant delay after the last IO
> completion action (C) before the next dd IO occurs.  ie 0.007686071
> seconds for connected vs 0.000184373 seconds for disconnected.
> > This behaviour is reproducible and is seen after every sequence of IO
> completion actions for the entire running of dd.
> >
> > From my understanding of the blktrace data, the IO completion action
> occurs every 1MB.
> >
> > My hunch is this delay is caused by network latency ie. transferring the
> data to the network layer and then across to the secondary DRBD node via
> the bonded interface.  Also, given Protocol C is being used, I understand
> IO blocking occurs until the data is actually written to disc (or
> controller BBU cache) so perhaps the secondary is responsible in this
> case?
> 
> Yes, that's all correct. Now, you can be sure that there is a latency
> issue.
> 
> > If my analysis is correct then DRBD is not responsible for the delays.
> 
> Perhaps, perhaps it is. It depends on what DRBD is doing in the
> make_request kernel function (which DRBD features are activated, etc.).
> This is the time/latency critical path.
> 
> > Does this sound sensible and reasonable?  If so, how can I confirm this
> is the case?  Is there a tool like blktrace for use on network IO?
> 
> Hmm, tcpdump could help. You'll get a time stamp for the outgoing
> packets and the incoming completions. Included is the latency of DRBD
> and the storage on the receiver side. But at least you don't see the
> latency of DRBD on the sender side. So with that you could see if there
> is a bigger issue before the network layer.
> 
> You could also try ping and iSCSI in order to collect some latency
> information of the connection to the other node. Modern NICs all use
> offloading features. These could be performance relevant. You can see
> the offloading stuff with "ethtool -k ethX".
> 
> The TCP/IP stack has also a bigger latency impact. But remember: You've
> got the transfer latency twice as data travels to the other node and
> completions come back. Before until the completions are received the
> network is idle. This time can be used for further parallel connections.
> 
> The iSCSI connection would have the same behavior but with a low-latency
> make_request function.
> But as you use iSCSI anyways to connect to the primary remember that you
> have TWO CHAINED network paths for storage writes here and only a single
> one for reads.
> 
> I guess, this is the biggest latency issue here and tells you NOT TO USE
> DRBD AT ALL!
> 
> Try this one: Use iSCSI for both storage servers, put MD RAID-1 above.
> Activate the write-intent bitmap so that connection loss doesn't result
> in full resync. MD RAID-1 has an intelligent read-balancing. So this is
> like Primary-Primary here and you have TWO PARALLEL network paths here
> with the same latency each!
> 
> Cheers,
> Sebastian