[DRBD-user] High iowait on primary DRBD node with large sustained writes and replication enabled to secondary

Paul Freeman paul.freeman at emlair.com.au
Tue Jan 15 02:53:52 CET 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Sebastian,
Thanks for your analyses - I really appreciate it.  I have taken a little while to reply as I wanted to complete a dd from the the iscsi peer to the primary iscsi node and run blktrace on the iscsi peer.

My comments are inline.

Regards

Paul
> -----Original Message-----
> From: Sebastian Riemer [mailto:sebastian.riemer at profitbricks.com]
> Sent: Friday, 11 January 2013 10:08 PM
> To: Paul Freeman
> Cc: drbd-user at lists.linbit.com
> Subject: Re: [DRBD-user] High iowait on primary DRBD node with large
> sustained writes and replication enabled to secondary
> 
> On 11.01.2013 04:36, Paul Freeman wrote:
> > Sebastian,
> > I have collected some tcpdump logs on the internode link used for drbd
> while writing 100M to the primary drbd node using dd (as before) and with
> the relevant drbd resource connected.
> >
> > I am not very proficient at interpreting the tcpdump traffic so please
> bear with me:-)
> >
> > I ran tcpdump with the options "-vvv -s0 tcp port 7798".  Port 7798 is
> the port used by this drbd resource.  I subsequently used wireshark to
> identify the tcp stream associated with the replication traffic and set an
> appropriate filter to extract those packets and save them to a separate
> packet capture file.
> >
> > My first guess is the segment ACK RTT may be relevant to our
> investigation.  When I extract the segment ACKs from the secondary node
> using tshark and tcp.analysis.ack_rtt then process them in MS Excel, the
> results are as follows:
> >
> > Number of segment ACKs:	5255
> > Average RTT: 1.87ms
> > Minimum RTT: 0.078ms
> > Maximum RTT: 41.45ms (actually the last ACK)
> >
> > The time for dd to complete was 950ms.
> >
> > Does this provide any information on the latency problem?
> 
> Better would be writing locally on the primary. So that we only see the
> network latency of the inter-node connection and not of the iSCSI
> connection as well.
> 

The numbers I provided were for writing to the iscsitarget from the iscsitarget server itself so they exclude the iscsi network latancy.

> I had a deeper look at your blktrace output.
> 
> 1 MB from queuing the first 128 KiB chunk to queuing the 9th 128 KiB
> chunk on the DRBD device lasts:
> 
> discon  connec
> 1.33 ms 8.86 ms
> 
> Difference: 7.53 ms.
> 
> Also the latency between queuing (Q) the first IO on the DRBD device and
> dispatching (D) it on the SCSI disk below changed. This is the change of
> DRBD latency.
> 
> discon  connec
> 8.7 us  12.8 us
> 
> 4.1 us * 8 = 32.8 us. This is negligible.
> 
> So the time the packets need in the network layer should be something <
> 7.5 ms until 1 MB is transferred. This is the additional time which
> hurts you when you are connected with DRBD.
> 
> Please do a blktrace on the iSCSI client for disconnected and connected
> DRBD primary. Then, we'll see which of the two chained network
> connections is faster.
> 
> This is what we know for 1 MB, now:
> iSCSI path: ? ms
> SCSI/disk: 1.33 ms
> between DRBD nodes: 7.53 ms
> 
> Overall latency = iSCSI path + SCSI/disk + between DRBD nodes.
> 
> Cheers,
> Sebastian

I have performed some more testing using dd and blktrace from the iscsi peer which is connected to the iscsi target (drbd primary node) (same target as in previous tests) via the lan using a bonded network connection of 2 1Gbps adaptors in an active-passive bond so I would expect max throughput of approx 800-900Mbps.

The 2 iscsi/drbd nodes are at present connected by a similar network bonding arrangement so again tcp network throughput will be approx. 800-900Mbps.

I tested with the drbd resource connected and disconnected.  The raw dd results were as follows.

Resource connected:
root at sur-vm-02:~# dd if=/dev/zero of=/dev/backup_01/vm-100-disk-1 oflag=direct bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.983008 s, 107 MB/s

Resource disoonnected:
root at sur-vm-02:~# dd if=/dev/zero of=/dev/backup_01/vm-100-disk-1 oflag=direct bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.980041 s, 107 MB/s

The blktrace data, for the resource connected state is presented in a similar manner to my earlier email.

dd and blktrace with resource connected:

253,12   6        1     0.000000000 949945  Q  WS 0 + 1024 [dd]
  8,64   6        1     0.000010264 949945  A  WS 2048 + 1024 <- (253,12) 0
  8,64   6        2     0.000012718 949945  Q  WS 2048 + 1024 [dd]
  8,64   6        3     0.000017482 949945  G  WS 2048 + 1024 [dd]
  8,64   6        4     0.000020597 949945  P   N [dd]
  8,64   6        5     0.000021889 949945  I  WS 2048 + 1024 [dd]
253,12   6        2     0.000053322 949945  Q  WS 1024 + 1024 [dd]
  8,64   6        6     0.000054656 949945  A  WS 3072 + 1024 <- (253,12) 1024
  8,64   6        7     0.000055258 949945  Q  WS 3072 + 1024 [dd]
  8,64   6        8     0.000056680 949945  G  WS 3072 + 1024 [dd]
  8,64   6        9     0.000057596 949945  I  WS 3072 + 1024 [dd]
253,12   6        3     0.000059072 949945  U   N [dd] 0
  8,64   6       10     0.000059508 949945  U   N [dd] 2
  8,64   6       11     0.000062659 949945  D  WS 2048 + 1024 [dd]
  8,64   6       12     0.000083516 949945  D  WS 3072 + 1024 [dd]
  8,64   4        1     0.008451596     0  C  WS 2048 + 1024 [0]
253,12   4        1     0.008455216     0  C  WS 0 + 1024 [0]
  8,64   4        2     0.010099424     0  C  WS 3072 + 1024 [0]
253,12   4        2     0.010101816     0  C  WS 1024 + 1024 [0]
253,12   4        3     0.010343363 949945  Q  WS 2048 + 1024 [dd]
  8,64   4        3     0.010345313 949945  A  WS 4096 + 1024 <- (253,12) 2048
  8,64   4        4     0.010346379 949945  Q  WS 4096 + 1024 [dd]
  8,64   4        5     0.010347427 949945  G  WS 4096 + 1024 [dd]
  8,64   4        6     0.010349276 949945  P   N [dd]
  8,64   4        7     0.010349774 949945  I  WS 4096 + 1024 [dd]
253,12   4        4     0.010379249 949945  Q  WS 3072 + 1024 [dd]
  8,64   4        8     0.010380159 949945  A  WS 5120 + 1024 <- (253,12) 3072
  8,64   4        9     0.010380787 949945  Q  WS 5120 + 1024 [dd]
  8,64   4       10     0.010381797 949945  G  WS 5120 + 1024 [dd]
  8,64   4       11     0.010382719 949945  I  WS 5120 + 1024 [dd]
253,12   4        5     0.010383437 949945  U   N [dd] 0
  8,64   4       12     0.010383737 949945  U   N [dd] 2
  8,64   4       13     0.010384935 949945  D  WS 4096 + 1024 [dd]
  8,64   4       14     0.010394358 949945  D  WS 5120 + 1024 [dd]
  8,64   4       15     0.018110846     0  C  WS 4096 + 1024 [0]
253,12   4        6     0.018112921     0  C  WS 2048 + 1024 [0]
  8,64   4       16     0.019763115     0  C  WS 5120 + 1024 [0]
253,12   4        7     0.019765293     0  C  WS 3072 + 1024 [0]
253,12   4        8     0.020006270 949945  Q  WS 4096 + 1024 [dd]
  8,64   4       17     0.020007928 949945  A  WS 6144 + 1024 <- (253,12) 4096
  8,64   4       18     0.020008906 949945  Q  WS 6144 + 1024 [dd]
  8,64   4       19     0.020010114 949945  G  WS 6144 + 1024 [dd]
  8,64   4       20     0.020011824 949945  P   N [dd]
  8,64   4       21     0.020012472 949945  I  WS 6144 + 1024 [dd]
253,12   4        9     0.020042173 949945  Q  WS 5120 + 1024 [dd]

dd and blktrace with resource disconnected:

253,12   6        1     0.000000000 935502  Q  WS 0 + 1024 [dd]
  8,64   6        1     0.000005862 935502  A  WS 2048 + 1024 <- (253,12) 0
  8,64   6        2     0.000008418 935502  Q  WS 2048 + 1024 [dd]
  8,64   6        3     0.000012204 935502  G  WS 2048 + 1024 [dd]
  8,64   6        4     0.000015305 935502  P   N [dd]
  8,64   6        5     0.000017047 935502  I  WS 2048 + 1024 [dd]
253,12   6        2     0.000049506 935502  Q  WS 1024 + 1024 [dd]
  8,64   6        6     0.000052540 935502  A  WS 3072 + 1024 <- (253,12) 1024
  8,64   6        7     0.000053122 935502  Q  WS 3072 + 1024 [dd]
  8,64   6        8     0.000055384 935502  G  WS 3072 + 1024 [dd]
  8,64   6        9     0.000056396 935502  I  WS 3072 + 1024 [dd]
253,12   6        3     0.000057981 935502  U   N [dd] 0
  8,64   6       10     0.000058619 935502  U   N [dd] 2
  8,64   6       11     0.000061829 935502  D  WS 2048 + 1024 [dd]
  8,64   6       12     0.000082332 935502  D  WS 3072 + 1024 [dd]
  8,64   4        1     0.008042208 902668  C  WS 2048 + 1024 [0]
253,12   4        1     0.008045714 902668  C  WS 0 + 1024 [0]
  8,64   4        2     0.009682479     0  C  WS 3072 + 1024 [0]
253,12   4        2     0.009684697     0  C  WS 1024 + 1024 [0]
253,12   4        3     0.009935296 935502  Q  WS 2048 + 1024 [dd]
  8,64   4        3     0.009937327 935502  A  WS 4096 + 1024 <- (253,12) 2048
  8,64   4        4     0.009938885 935502  Q  WS 4096 + 1024 [dd]
  8,64   4        5     0.009940089 935502  G  WS 4096 + 1024 [dd]
  8,64   4        6     0.009941995 935502  P   N [dd]
  8,64   4        7     0.009942455 935502  I  WS 4096 + 1024 [dd]
253,12   4        4     0.009972506 935502  Q  WS 3072 + 1024 [dd]
  8,64   4        8     0.009973408 935502  A  WS 5120 + 1024 <- (253,12) 3072
  8,64   4        9     0.009974076 935502  Q  WS 5120 + 1024 [dd]
  8,64   4       10     0.009975116 935502  G  WS 5120 + 1024 [dd]
  8,64   4       11     0.009976072 935502  I  WS 5120 + 1024 [dd]
253,12   4        5     0.009976980 935502  U   N [dd] 0
  8,64   4       12     0.009977378 935502  U   N [dd] 2
  8,64   4       13     0.009978559 935502  D  WS 4096 + 1024 [dd]
  8,64   4       14     0.009986557 935502  D  WS 5120 + 1024 [dd]
  8,64   4       15     0.017681675     0  C  WS 4096 + 1024 [0]
253,12   4        6     0.017683305     0  C  WS 2048 + 1024 [0]
  8,64   4       16     0.019457154     0  C  WS 5120 + 1024 [0]
253,12   4        7     0.019458808     0  C  WS 3072 + 1024 [0]
253,12   4        8     0.019700743 935502  Q  WS 4096 + 1024 [dd]
  8,64   4       17     0.019703005 935502  A  WS 6144 + 1024 <- (253,12) 4096
  8,64   4       18     0.019704175 935502  Q  WS 6144 + 1024 [dd]
  8,64   4       19     0.019705315 935502  G  WS 6144 + 1024 [dd]
  8,64   4       20     0.019706887 935502  P   N [dd]
  8,64   4       21     0.019707887 935502  I  WS 6144 + 1024 [dd]
253,12   4        9     0.019738113 935502  Q  WS 5120 + 1024 [dd]

Running dd from the issci peer shows it is using a 512KiB block size across the link to the iscsi target.

Please let me know if you require any further information.

Importantly, I believe I have discovered the cause of the high IO wait I was seeing during the sustained writing of a larger (40GB) file from the iscsi peer to the iscsi target (drbd primary node).

During searching on Google for latency in kernel 3.2 I came across a number of postings where the intel_idle cpuidle driver was mentioned as causing some latency as a result of putting the processor(s) into sleep states > C1.  When I checked using the i7z tool from Google Code (http://code.google.com/p/i7z/) the processors were spending a significant % of time in the C3 and C6 states.  This was even though I had disabled Intel Speedstep and other power saving features in the BIOS.  As per http://stackoverflow.com/questions/12111954/context-switches-much-slower-in-new-linux-kernels 
I added the recommend kernel options to disable the intel_idle driver as follows:

intel_idle.max_cstate=0 processor.max_cstate=1 idle=mwait

After rebooting and performing the restore from the iscsi peer to the iscsi target (drbd primary) IO wait was < 2% compared with >99% across all processors prior to the change.

Thanks again

Paul



More information about the drbd-user mailing list