[DRBD-user] High iowait on primary DRBD node with large sustained writes and replication enabled to secondary

Paul Freeman paul.freeman at emlair.com.au
Tue Jan 15 23:14:32 CET 2013

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Andy,
I don't think the network bandwidth between the two iscsi/drbd nodes is the limiting factor.

I have changed to a balanced round-robin bond instead of active-passive and testing with iperf gives about 1.8 Gigabits/sec.

However, the high io wait remains the same as with the bond in active-passive mode.

I have the iscsi peer connected via a 1 Gbps active-passive bond to the lan (and thus the iscsi/drbd nodes).  The highest spped transfer I have seen during a restore (Proxmox qmrestore) to the primary iscsi/drbd node in the disconnected state is about 110Megabits/s.

Thanks for the suggestion though - I appreciate it.

I tried lowering the unplug-watermark from the default (128) to 16.  This seemed to make a difference in the first test.  The io wait did reach 99% but it then dropped back before climbing again then dropping back.  This was repeated.

However, when I tested it the second time the iscsi connection timed out a couple of minutes into the restore and the iscsi target was disconnected on that Proxmox node :-(  It was necessary to delete that nodes iscsi session to clear the problem.

Thanks again

Paul

> -----Original Message-----
> From: Andy Dills [mailto:andy at xecu.net]
> Sent: Wednesday, 16 January 2013 3:15 AM
> To: Paul Freeman
> Subject: Re: [DRBD-user] High iowait on primary DRBD node with large
> sustained writes and replication enabled to secondary
> 
> 
> Paul,
> 
> Is it possible you're being constrained by the bandwidth connecting your
> two drbd servers?
> 
> As I learned in my thread, writes to the local volume are constrained by
> the bandwidth between it and the connected volume...basically it sounds
> like 10gbs is the only way to go.
> 
> Not sure it totally explains the iowait situation, but it's worth
> experimenting with...for example I changed my bond method to balance-rr
> and was able to get over 200MBps on dd writes. I'm curious if you
> increased the bandwidth between the drbd nodes even by that much, if it
> would change your results.
> 
> Andy
> 
> On Tue, 15 Jan 2013, Paul Freeman wrote:
> 
> > Sorry for replying to my own email but my initial thoughts that I had
> resolved the high IO wait issue have proved to be wrong.
> >
> > It seems the testing I did on a test iscsi target created on the
> iscsi/drbd servers does not perform the same as a production iscsi target.
> Not sure why at this stage apart from the fact that it is giving me best
> case results as nothing else is using the target compared with the
> production ones.
> >
> > Back to testing.
> >
> > Apologies for the mis-information.
> >
> > Regards
> >
> > Paul
> >
> > > -----Original Message-----
> > > From: drbd-user-bounces at lists.linbit.com [mailto:drbd-user-
> > > bounces at lists.linbit.com] On Behalf Of Paul Freeman
> > > Sent: Tuesday, 15 January 2013 12:54 PM
> > > To: Sebastian Riemer
> > > Cc: drbd-user at lists.linbit.com
> > > Subject: Re: [DRBD-user] High iowait on primary DRBD node with large
> > > sustained writes and replication enabled to secondary
> > >
> > >
> > > Sebastian,
> > > Thanks for your analyses - I really appreciate it.  I have taken a
> little
> > > while to reply as I wanted to complete a dd from the the iscsi peer to
> the
> > > primary iscsi node and run blktrace on the iscsi peer.
> > >
> > > My comments are inline.
> > >
> > > Regards
> > >
> > > Paul
> > > > -----Original Message-----
> > > > From: Sebastian Riemer [mailto:sebastian.riemer at profitbricks.com]
> > > > Sent: Friday, 11 January 2013 10:08 PM
> > > > To: Paul Freeman
> > > > Cc: drbd-user at lists.linbit.com
> > > > Subject: Re: [DRBD-user] High iowait on primary DRBD node with large
> > > > sustained writes and replication enabled to secondary
> > > >
> > > > On 11.01.2013 04:36, Paul Freeman wrote:
> > > > > Sebastian,
> > > > > I have collected some tcpdump logs on the internode link used for
> drbd
> > > > while writing 100M to the primary drbd node using dd (as before) and
> > > with
> > > > the relevant drbd resource connected.
> > > > >
> > > > > I am not very proficient at interpreting the tcpdump traffic so
> please
> > > > bear with me:-)
> > > > >
> > > > > I ran tcpdump with the options "-vvv -s0 tcp port 7798".  Port
> 7798 is
> > > > the port used by this drbd resource.  I subsequently used wireshark
> to
> > > > identify the tcp stream associated with the replication traffic and
> set
> > > an
> > > > appropriate filter to extract those packets and save them to a
> separate
> > > > packet capture file.
> > > > >
> > > > > My first guess is the segment ACK RTT may be relevant to our
> > > > investigation.  When I extract the segment ACKs from the secondary
> node
> > > > using tshark and tcp.analysis.ack_rtt then process them in MS Excel,
> the
> > > > results are as follows:
> > > > >
> > > > > Number of segment ACKs:	5255
> > > > > Average RTT: 1.87ms
> > > > > Minimum RTT: 0.078ms
> > > > > Maximum RTT: 41.45ms (actually the last ACK)
> > > > >
> > > > > The time for dd to complete was 950ms.
> > > > >
> > > > > Does this provide any information on the latency problem?
> > > >
> > > > Better would be writing locally on the primary. So that we only see
> the
> > > > network latency of the inter-node connection and not of the iSCSI
> > > > connection as well.
> > > >
> > >
> > > The numbers I provided were for writing to the iscsitarget from the
> > > iscsitarget server itself so they exclude the iscsi network latancy.
> > >
> > > > I had a deeper look at your blktrace output.
> > > >
> > > > 1 MB from queuing the first 128 KiB chunk to queuing the 9th 128 KiB
> > > > chunk on the DRBD device lasts:
> > > >
> > > > discon  connec
> > > > 1.33 ms 8.86 ms
> > > >
> > > > Difference: 7.53 ms.
> > > >
> > > > Also the latency between queuing (Q) the first IO on the DRBD device
> and
> > > > dispatching (D) it on the SCSI disk below changed. This is the
> change of
> > > > DRBD latency.
> > > >
> > > > discon  connec
> > > > 8.7 us  12.8 us
> > > >
> > > > 4.1 us * 8 = 32.8 us. This is negligible.
> > > >
> > > > So the time the packets need in the network layer should be
> something <
> > > > 7.5 ms until 1 MB is transferred. This is the additional time which
> > > > hurts you when you are connected with DRBD.
> > > >
> > > > Please do a blktrace on the iSCSI client for disconnected and
> connected
> > > > DRBD primary. Then, we'll see which of the two chained network
> > > > connections is faster.
> > > >
> > > > This is what we know for 1 MB, now:
> > > > iSCSI path: ? ms
> > > > SCSI/disk: 1.33 ms
> > > > between DRBD nodes: 7.53 ms
> > > >
> > > > Overall latency = iSCSI path + SCSI/disk + between DRBD nodes.
> > > >
> > > > Cheers,
> > > > Sebastian
> > >
> > > I have performed some more testing using dd and blktrace from the
> iscsi
> > > peer which is connected to the iscsi target (drbd primary node) (same
> > > target as in previous tests) via the lan using a bonded network
> connection
> > > of 2 1Gbps adaptors in an active-passive bond so I would expect max
> > > throughput of approx 800-900Mbps.
> > >
> > > The 2 iscsi/drbd nodes are at present connected by a similar network
> > > bonding arrangement so again tcp network throughput will be approx.
> 800-
> > > 900Mbps.
> > >
> > > I tested with the drbd resource connected and disconnected.  The raw
> dd
> > > results were as follows.
> > >
> > > Resource connected:
> > > root at sur-vm-02:~# dd if=/dev/zero of=/dev/backup_01/vm-100-disk-1
> > > oflag=direct bs=1M count=100
> > > 100+0 records in
> > > 100+0 records out
> > > 104857600 bytes (105 MB) copied, 0.983008 s, 107 MB/s
> > >
> > > Resource disoonnected:
> > > root at sur-vm-02:~# dd if=/dev/zero of=/dev/backup_01/vm-100-disk-1
> > > oflag=direct bs=1M count=100
> > > 100+0 records in
> > > 100+0 records out
> > > 104857600 bytes (105 MB) copied, 0.980041 s, 107 MB/s
> > >
> > > The blktrace data, for the resource connected state is presented in a
> > > similar manner to my earlier email.
> > >
> > > dd and blktrace with resource connected:
> > >
> > > 253,12   6        1     0.000000000 949945  Q  WS 0 + 1024 [dd]
> > >   8,64   6        1     0.000010264 949945  A  WS 2048 + 1024 <-
> (253,12)
> > > 0
> > >   8,64   6        2     0.000012718 949945  Q  WS 2048 + 1024 [dd]
> > >   8,64   6        3     0.000017482 949945  G  WS 2048 + 1024 [dd]
> > >   8,64   6        4     0.000020597 949945  P   N [dd]
> > >   8,64   6        5     0.000021889 949945  I  WS 2048 + 1024 [dd]
> > > 253,12   6        2     0.000053322 949945  Q  WS 1024 + 1024 [dd]
> > >   8,64   6        6     0.000054656 949945  A  WS 3072 + 1024 <-
> (253,12)
> > > 1024
> > >   8,64   6        7     0.000055258 949945  Q  WS 3072 + 1024 [dd]
> > >   8,64   6        8     0.000056680 949945  G  WS 3072 + 1024 [dd]
> > >   8,64   6        9     0.000057596 949945  I  WS 3072 + 1024 [dd]
> > > 253,12   6        3     0.000059072 949945  U   N [dd] 0
> > >   8,64   6       10     0.000059508 949945  U   N [dd] 2
> > >   8,64   6       11     0.000062659 949945  D  WS 2048 + 1024 [dd]
> > >   8,64   6       12     0.000083516 949945  D  WS 3072 + 1024 [dd]
> > >   8,64   4        1     0.008451596     0  C  WS 2048 + 1024 [0]
> > > 253,12   4        1     0.008455216     0  C  WS 0 + 1024 [0]
> > >   8,64   4        2     0.010099424     0  C  WS 3072 + 1024 [0]
> > > 253,12   4        2     0.010101816     0  C  WS 1024 + 1024 [0]
> > > 253,12   4        3     0.010343363 949945  Q  WS 2048 + 1024 [dd]
> > >   8,64   4        3     0.010345313 949945  A  WS 4096 + 1024 <-
> (253,12)
> > > 2048
> > >   8,64   4        4     0.010346379 949945  Q  WS 4096 + 1024 [dd]
> > >   8,64   4        5     0.010347427 949945  G  WS 4096 + 1024 [dd]
> > >   8,64   4        6     0.010349276 949945  P   N [dd]
> > >   8,64   4        7     0.010349774 949945  I  WS 4096 + 1024 [dd]
> > > 253,12   4        4     0.010379249 949945  Q  WS 3072 + 1024 [dd]
> > >   8,64   4        8     0.010380159 949945  A  WS 5120 + 1024 <-
> (253,12)
> > > 3072
> > >   8,64   4        9     0.010380787 949945  Q  WS 5120 + 1024 [dd]
> > >   8,64   4       10     0.010381797 949945  G  WS 5120 + 1024 [dd]
> > >   8,64   4       11     0.010382719 949945  I  WS 5120 + 1024 [dd]
> > > 253,12   4        5     0.010383437 949945  U   N [dd] 0
> > >   8,64   4       12     0.010383737 949945  U   N [dd] 2
> > >   8,64   4       13     0.010384935 949945  D  WS 4096 + 1024 [dd]
> > >   8,64   4       14     0.010394358 949945  D  WS 5120 + 1024 [dd]
> > >   8,64   4       15     0.018110846     0  C  WS 4096 + 1024 [0]
> > > 253,12   4        6     0.018112921     0  C  WS 2048 + 1024 [0]
> > >   8,64   4       16     0.019763115     0  C  WS 5120 + 1024 [0]
> > > 253,12   4        7     0.019765293     0  C  WS 3072 + 1024 [0]
> > > 253,12   4        8     0.020006270 949945  Q  WS 4096 + 1024 [dd]
> > >   8,64   4       17     0.020007928 949945  A  WS 6144 + 1024 <-
> (253,12)
> > > 4096
> > >   8,64   4       18     0.020008906 949945  Q  WS 6144 + 1024 [dd]
> > >   8,64   4       19     0.020010114 949945  G  WS 6144 + 1024 [dd]
> > >   8,64   4       20     0.020011824 949945  P   N [dd]
> > >   8,64   4       21     0.020012472 949945  I  WS 6144 + 1024 [dd]
> > > 253,12   4        9     0.020042173 949945  Q  WS 5120 + 1024 [dd]
> > >
> > > dd and blktrace with resource disconnected:
> > >
> > > 253,12   6        1     0.000000000 935502  Q  WS 0 + 1024 [dd]
> > >   8,64   6        1     0.000005862 935502  A  WS 2048 + 1024 <-
> (253,12)
> > > 0
> > >   8,64   6        2     0.000008418 935502  Q  WS 2048 + 1024 [dd]
> > >   8,64   6        3     0.000012204 935502  G  WS 2048 + 1024 [dd]
> > >   8,64   6        4     0.000015305 935502  P   N [dd]
> > >   8,64   6        5     0.000017047 935502  I  WS 2048 + 1024 [dd]
> > > 253,12   6        2     0.000049506 935502  Q  WS 1024 + 1024 [dd]
> > >   8,64   6        6     0.000052540 935502  A  WS 3072 + 1024 <-
> (253,12)
> > > 1024
> > >   8,64   6        7     0.000053122 935502  Q  WS 3072 + 1024 [dd]
> > >   8,64   6        8     0.000055384 935502  G  WS 3072 + 1024 [dd]
> > >   8,64   6        9     0.000056396 935502  I  WS 3072 + 1024 [dd]
> > > 253,12   6        3     0.000057981 935502  U   N [dd] 0
> > >   8,64   6       10     0.000058619 935502  U   N [dd] 2
> > >   8,64   6       11     0.000061829 935502  D  WS 2048 + 1024 [dd]
> > >   8,64   6       12     0.000082332 935502  D  WS 3072 + 1024 [dd]
> > >   8,64   4        1     0.008042208 902668  C  WS 2048 + 1024 [0]
> > > 253,12   4        1     0.008045714 902668  C  WS 0 + 1024 [0]
> > >   8,64   4        2     0.009682479     0  C  WS 3072 + 1024 [0]
> > > 253,12   4        2     0.009684697     0  C  WS 1024 + 1024 [0]
> > > 253,12   4        3     0.009935296 935502  Q  WS 2048 + 1024 [dd]
> > >   8,64   4        3     0.009937327 935502  A  WS 4096 + 1024 <-
> (253,12)
> > > 2048
> > >   8,64   4        4     0.009938885 935502  Q  WS 4096 + 1024 [dd]
> > >   8,64   4        5     0.009940089 935502  G  WS 4096 + 1024 [dd]
> > >   8,64   4        6     0.009941995 935502  P   N [dd]
> > >   8,64   4        7     0.009942455 935502  I  WS 4096 + 1024 [dd]
> > > 253,12   4        4     0.009972506 935502  Q  WS 3072 + 1024 [dd]
> > >   8,64   4        8     0.009973408 935502  A  WS 5120 + 1024 <-
> (253,12)
> > > 3072
> > >   8,64   4        9     0.009974076 935502  Q  WS 5120 + 1024 [dd]
> > >   8,64   4       10     0.009975116 935502  G  WS 5120 + 1024 [dd]
> > >   8,64   4       11     0.009976072 935502  I  WS 5120 + 1024 [dd]
> > > 253,12   4        5     0.009976980 935502  U   N [dd] 0
> > >   8,64   4       12     0.009977378 935502  U   N [dd] 2
> > >   8,64   4       13     0.009978559 935502  D  WS 4096 + 1024 [dd]
> > >   8,64   4       14     0.009986557 935502  D  WS 5120 + 1024 [dd]
> > >   8,64   4       15     0.017681675     0  C  WS 4096 + 1024 [0]
> > > 253,12   4        6     0.017683305     0  C  WS 2048 + 1024 [0]
> > >   8,64   4       16     0.019457154     0  C  WS 5120 + 1024 [0]
> > > 253,12   4        7     0.019458808     0  C  WS 3072 + 1024 [0]
> > > 253,12   4        8     0.019700743 935502  Q  WS 4096 + 1024 [dd]
> > >   8,64   4       17     0.019703005 935502  A  WS 6144 + 1024 <-
> (253,12)
> > > 4096
> > >   8,64   4       18     0.019704175 935502  Q  WS 6144 + 1024 [dd]
> > >   8,64   4       19     0.019705315 935502  G  WS 6144 + 1024 [dd]
> > >   8,64   4       20     0.019706887 935502  P   N [dd]
> > >   8,64   4       21     0.019707887 935502  I  WS 6144 + 1024 [dd]
> > > 253,12   4        9     0.019738113 935502  Q  WS 5120 + 1024 [dd]
> > >
> > > Running dd from the issci peer shows it is using a 512KiB block size
> > > across the link to the iscsi target.
> > >
> > > Please let me know if you require any further information.
> > >
> > > Importantly, I believe I have discovered the cause of the high IO wait
> I
> > > was seeing during the sustained writing of a larger (40GB) file from
> the
> > > iscsi peer to the iscsi target (drbd primary node).
> > >
> > > During searching on Google for latency in kernel 3.2 I came across a
> > > number of postings where the intel_idle cpuidle driver was mentioned
> as
> > > causing some latency as a result of putting the processor(s) into
> sleep
> > > states > C1.  When I checked using the i7z tool from Google Code
> > > (http://code.google.com/p/i7z/) the processors were spending a
> significant
> > > % of time in the C3 and C6 states.  This was even though I had
> disabled
> > > Intel Speedstep and other power saving features in the BIOS.  As per
> > > http://stackoverflow.com/questions/12111954/context-switches-much-
> slower-
> > > in-new-linux-kernels
> > > I added the recommend kernel options to disable the intel_idle driver
> as
> > > follows:
> > >
> > > intel_idle.max_cstate=0 processor.max_cstate=1 idle=mwait
> > >
> > > After rebooting and performing the restore from the iscsi peer to the
> > > iscsi target (drbd primary) IO wait was < 2% compared with >99% across
> all
> > > processors prior to the change.
> > >
> > > Thanks again
> > >
> > > Paul
> > > _______________________________________________
> > > drbd-user mailing list
> > > drbd-user at lists.linbit.com
> > > http://lists.linbit.com/mailman/listinfo/drbd-user
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user at lists.linbit.com
> > http://lists.linbit.com/mailman/listinfo/drbd-user
> >
> 
> ---
> Andy Dills
> Xecunet, Inc.
> www.xecu.net
> 301-682-9972
> ---



More information about the drbd-user mailing list