Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 09.01.2013 04:51, Paul Freeman wrote: ... > Analysis: > 1. In both connected and disconnected cases, the write sizes used by DRBD are 128KiB so I think this is OK. At least they are not 4KiB which would indicate the IO problem you mention. > > 2. In the connected case there is a significant delay after the last IO completion action (C) before the next dd IO occurs. ie 0.007686071 seconds for connected vs 0.000184373 seconds for disconnected. > This behaviour is reproducible and is seen after every sequence of IO completion actions for the entire running of dd. > > From my understanding of the blktrace data, the IO completion action occurs every 1MB. > > My hunch is this delay is caused by network latency ie. transferring the data to the network layer and then across to the secondary DRBD node via the bonded interface. Also, given Protocol C is being used, I understand IO blocking occurs until the data is actually written to disc (or controller BBU cache) so perhaps the secondary is responsible in this case? Yes, that's all correct. Now, you can be sure that there is a latency issue. > If my analysis is correct then DRBD is not responsible for the delays. Perhaps, perhaps it is. It depends on what DRBD is doing in the make_request kernel function (which DRBD features are activated, etc.). This is the time/latency critical path. > Does this sound sensible and reasonable? If so, how can I confirm this is the case? Is there a tool like blktrace for use on network IO? Hmm, tcpdump could help. You'll get a time stamp for the outgoing packets and the incoming completions. Included is the latency of DRBD and the storage on the receiver side. But at least you don't see the latency of DRBD on the sender side. So with that you could see if there is a bigger issue before the network layer. You could also try ping and iSCSI in order to collect some latency information of the connection to the other node. Modern NICs all use offloading features. These could be performance relevant. You can see the offloading stuff with "ethtool -k ethX". The TCP/IP stack has also a bigger latency impact. But remember: You've got the transfer latency twice as data travels to the other node and completions come back. Before until the completions are received the network is idle. This time can be used for further parallel connections. The iSCSI connection would have the same behavior but with a low-latency make_request function. But as you use iSCSI anyways to connect to the primary remember that you have TWO CHAINED network paths for storage writes here and only a single one for reads. I guess, this is the biggest latency issue here and tells you NOT TO USE DRBD AT ALL! Try this one: Use iSCSI for both storage servers, put MD RAID-1 above. Activate the write-intent bitmap so that connection loss doesn't result in full resync. MD RAID-1 has an intelligent read-balancing. So this is like Primary-Primary here and you have TWO PARALLEL network paths here with the same latency each! Cheers, Sebastian