[DRBD-user] High iowait on primary DRBD node with large sustained writes and replication enabled to secondary

Wed Jan 9 11:35:00 CET 2013

On 09.01.2013 04:51, Paul Freeman wrote:
...
> Analysis:
> 1. In both connected and disconnected cases, the write sizes used by DRBD are 128KiB so I think this is OK.  At least they are not 4KiB which would indicate the IO problem you mention.
> 
> 2. In the connected case there is a significant delay after the last IO completion action (C) before the next dd IO occurs.  ie 0.007686071 seconds for connected vs 0.000184373 seconds for disconnected.
> This behaviour is reproducible and is seen after every sequence of IO completion actions for the entire running of dd.
> 
> From my understanding of the blktrace data, the IO completion action occurs every 1MB.
> 
> My hunch is this delay is caused by network latency ie. transferring the data to the network layer and then across to the secondary DRBD node via the bonded interface.  Also, given Protocol C is being used, I understand IO blocking occurs until the data is actually written to disc (or controller BBU cache) so perhaps the secondary is responsible in this case?

Yes, that's all correct. Now, you can be sure that there is a latency issue.

> If my analysis is correct then DRBD is not responsible for the delays.

Perhaps, perhaps it is. It depends on what DRBD is doing in the
make_request kernel function (which DRBD features are activated, etc.).
This is the time/latency critical path.

> Does this sound sensible and reasonable?  If so, how can I confirm this is the case?  Is there a tool like blktrace for use on network IO?

Hmm, tcpdump could help. You'll get a time stamp for the outgoing
packets and the incoming completions. Included is the latency of DRBD
and the storage on the receiver side. But at least you don't see the
latency of DRBD on the sender side. So with that you could see if there
is a bigger issue before the network layer.

You could also try ping and iSCSI in order to collect some latency
information of the connection to the other node. Modern NICs all use
offloading features. These could be performance relevant. You can see
the offloading stuff with "ethtool -k ethX".

The TCP/IP stack has also a bigger latency impact. But remember: You've
got the transfer latency twice as data travels to the other node and
completions come back. Before until the completions are received the
network is idle. This time can be used for further parallel connections.

The iSCSI connection would have the same behavior but with a low-latency
make_request function.
But as you use iSCSI anyways to connect to the primary remember that you
have TWO CHAINED network paths for storage writes here and only a single
one for reads.

I guess, this is the biggest latency issue here and tells you NOT TO USE
DRBD AT ALL!

Try this one: Use iSCSI for both storage servers, put MD RAID-1 above.
Activate the write-intent bitmap so that connection loss doesn't result
in full resync. MD RAID-1 has an intelligent read-balancing. So this is
like Primary-Primary here and you have TWO PARALLEL network paths here
with the same latency each!

Cheers,
Sebastian