<html>

<head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> 

</head>

<body>

<h1>Interesting observations of iostat</h1>

<p><a href="mailto:phil@macprofessionals.com">Phil Frost</a><br/>

<a href="http://macprofessionals.com/">Macprofessionals</a></p>

<h2>Introduction</h2>

<p>If you have ever tried to characterize IO performance, you have probably

heard of iostat. If not, it's a part of the

<a href="http://sebastien.godard.pagesperso-orange.fr/">sysstat</a>

package which monitors the

<a href="http://www.mjmwired.net/kernel/Documentation/block/stat.txt">block

device statistics provided by Linux</a>, among many other things.

It can give an administrator great insight into what sort of IO load is being

handled by a device, and how well it's keeping pace. An example run, taken from

one of our nameservers, looks like this:</p>

<pre>$ iostat -dx

Linux 2.6.32-5-xen-amd64 (ns1)  05/18/12        _x86_64_        (1 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

xvdap2            0.00     0.17    1.24    0.37    39.70     4.34    27.34     0.09   57.30   4.22   0.68

xvdap1            1.22     3.91    3.57    0.49    38.26    35.18    18.11     0.29   72.37   1.32   0.54</pre>

<p>The basics of this tool are well documented. There is of course the

<a href="http://linux.die.net/man/1/iostat">manual page</a>. I also like this

<a href="http://dom.as/2009/03/11/iostat/">post by Domas Mituzas</a> to explain

the basics. I had read many such posts, but I was still confused by the numbers

iostat was giving me. I'm going to explain the meaning of these numbers with

some mathematical rigor, show how they are calculated, and explain some of the

more odd ways in which they may behave.</p>

<p><i>iostat</i> performs all its calculations from a handful of counters

provided by the Linux kernel. They can be access either from /sys/block/*/stats

or /proc/diskstats. Let's give some names to the counters for the sake of this

discussion:</p>

<dl>

    <dt>rdios</dt>     <dd>number of read I/Os processed</dd>

    <dt>rdmerges</dt>  <dd>number of read I/Os merged with in-queue I/O</dd>

    <dt>rdsectors</dt> <dd>number of sectors read</dd>

    <dt>rdticks</dt>   <dd>total wait time for read requests</dd>

    <dt>wrios</dt>     <dd>number of write I/Os processed</dd>

    <dt>wrmerges</dt>  <dd>number of write I/Os merged with in-queue I/O</dd>

    <dt>wrsectors</dt> <dd>number of sectors written</dd>

    <dt>wrticks</dt>   <dd>total wait time for write requests</dd>

    <dt>iosprg</dt>    <dd>number of I/Os currently in flight</dd>

    <dt>totticks</dt>  <dd>total time this block device has been active</dd>

    <dt>rqticks</dt>   <dd>total wait time for all requests</dd>

<dt>

<p><a href="http://www.mjmwired.net/kernel/Documentation/block/stat.txt">The

Linux documentation</a> explains the significance of these counters in greater

detail. However, they are mostly obvious, except for the difference between

<i>totticks</i> and <i>rqticks</i>. The difference there is key to the

following discussion, so I'll elaborate.</p>

<p><i>totticks</i> is incremented by 1 for every ms that passes where there is

something in the queue. It doesn't matter if there is 1 request in the queue,

or 100, this counter still increments by just one for each 1 ms of wall clock

time that elapses.</p>

<p><i>rqticks</i> differs because it's weighted by the number of requests

in the queue. The aforementioned kernel documentation offers a more complicated

explanation, but I find it easier to describe this field as the sum of

<i>rdticks</i> and <i>wrticks</i>. Each time a request enters the queue, it

begins incrementing <i>rdticks</i> or <i>wrticks</i> by 1 for each ms it spends

in the queue.  If there is more than 1 request in the queue, then these

counters will be incremented by each waiting request. So, if in 1 ms of wall

clock time, there were 2 requests in the queue, these counters would increment

by 2, while </i>totticks</i> increments only by 1.</p>

<p>It should be obvious how most of iostat's numbers are calculated from these

counters. <i>r/s</i>, for example, is the change in <i>rdios</i> over the

sampling period, divided by the length of the period. The last four numbers are

more interesting:</p>

<!-- \text{avgqu-sz} = \frac{\Delta rqticks}{p} -->

<p><img src="http://chart.apis.google.com/chart?cht=tx&chl=%5Ctext%7Bavgqu-sz%7D%20%3D%20%5Cfrac%7B%5CDelta%20rqticks%7D%7Bp%7D" /></p>

<!-- \text{await} = \frac{\Delta rqticks}{\Delta rdios + \Delta wrios} -->

<p><img src="http://chart.apis.google.com/chart?cht=tx&chl=%5Ctext%7Bawait%7D%20%3D%20%5Cfrac%7B%5CDelta%20rqticks%7D%7B%5CDelta%20rdios%20%2B%20%5CDelta%20wrios%7D" /></p>

<!-- \text{svctm} = \frac{\Delta totticks}{\Delta rdios + \Delta wrios} -->

<p><img src="http://chart.apis.google.com/chart?cht=tx&chl=%5Ctext%7Bsvctm%7D%20%3D%20%5Cfrac%7B%5CDelta%20totticks%7D%7B%5CDelta%20rdios%20%2B%20%5CDelta%20wrios%7D" /></p>

<!-- \text{util} = \frac{\Delta totticks \cdot 100}{p} -->

<p><img src="http://chart.apis.google.com/chart?cht=tx&chl=%5Ctext%7Butil%7D%20%3D%20%5Cfrac%7B%5CDelta%20totticks%20%5Ccdot%20100%7D%7Bp%7D" /></p>

<p>It's notable that there's nothing instrumenting the difference between time

spent waiting in the queue vs time spend waiting for a disk to service a

request. The descriptions of <i>await</i> and <i>svctm</i> might suggest there

is, but look carefully at the counters provided by the kernel, and you will see

there is not. <i>svctm</i> is simply inferred: if in 10 ms, two requests were

serviced, then 5 ms must have spent on average 5 ms servicing each

one. It may make more sense to think of <i>svctm</i> as average IOPS for

the time the device was not idle. If your device is never idle,

<i>Δtotticks</i> is equal to the sample period, and then you can see

<i>svctm</i> is exactly equal to 1/IOPS. This makes it a useful metric to

monitor, since you should have some idea of the IOPS a device is capable of

delivering, but you may not always be fully loading it. If a device is

sometimes idle, <i>r/s</i> and <i>w/s</i> will be less than the device's

capabilities, but <i>svctm</i> will not be.</p>

<p>Of course, the IOPS a device can deliver depends on the sort of load given

to it. Sequential accesses will be faster than random accesses for spinning

media. A more complicated case is a RAID device, which can achieve higher IOPS

by servicing requests concurrently with multiple disks, but if there is only

one request in line to be serviced, the advantage of concurrency is moot.

Further, some IO schedulers may intentionally idle the device in anticipation

of more IO. All of these mechanisms can decrease IOPS, and consequently

increase <i>svctm</i>.</p>

<p>Now let's look at another example iostat run, one which confused me for a

while. I noticed this after I had started monitoring iostat on all my servers,

and I started receiving alerts for this server, which wasn't experiencing any

application performance issues:</p>

<pre>Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

xvdap3            0.00     1.03    1.32    0.74    10.18    14.17    11.79     0.26  126.68   2.67   0.55</pre>

<p>Why is await so high? <i>svctm</i> tells me that the disk is

responding at a very reasonable speed for a spinning disk. </i>avgqu-sz</i>,

<i>r/s</i>, and <i>w/s</i> are low, and certainly within the capabilities of

even the slowest disk found in modern hardware. Further, application

performance is fine. Why then is <i>await</i> high?</p>

<p>To understand this, it's helpful to calculate another metric. We already

have <i>avgqu-sz</i>, which tells us the average number of requests in the

queue over the entire sampling period. However, if we divide this by the

utilization (divide <i>%util</i> by 100), we get the average queue

length over the time the device was not idle. This leads to an interesting

relationship between the last four iostat metrics. (Remember I'm implicitly

dividing %util by 100):</p>

<!-- await = \frac{avgqusz}{util} \cdot svctm -->

<p><img src="http://chart.apis.google.com/chart?cht=tx&chl=await%20%3D%20%5Cfrac%7Bavgqusz%7D%7Butil%7D%20%5Ccdot%20svctm" /></p>

<p>That is, the average time a request must wait is the average length of the

queue when the device is not idle, multiplied by the average time it takes the

device to service one request. Rearranging a bit we can make another statement

which is useful when thinking about the IO system:</p>

<!-- \frac {await}{svctm} = \frac{avgqusz}{util} -->

<p><img src="http://chart.apis.google.com/chart?cht=tx&chl=%5Cfrac%20%7Bawait%7D%7Bsvctm%7D%20%3D%20%5Cfrac%7Bavgqusz%7D%7Butil%7D" /></p>

<p>That is, the extent to which <i>await</i> is greater than <i>svctm</i> is

proportional to the average queue length when the device is not idle.

"<i>await</i> should not be much greater than <i>svctm</i>" is advice I've

heard many times. This is why, but it's not good advice for all applications.</p>

<p>I can prove this mathematically by substituting the definitions of the

metrics from above:</p>

<!-- \frac{\frac{\Delta rqticks}{\Delta rdios + \Delta wrios}}{\frac{\Delta totticks}{\Delta rdios + \Delta wrios}} = \frac{\frac{\Delta rqticks}{p}}{\frac{\Delta totticks}{p}} -->

<p><img src="http://chart.apis.google.com/chart?cht=tx&chl=%5Cfrac%7B%5Cfrac%7B%5CDelta%20rqticks%7D%7B%5CDelta%20rdios%20%2B%20%5CDelta%20wrios%7D%7D%7B%5Cfrac%7B%5CDelta%20totticks%7D%7B%5CDelta%20rdios%20%2B%20%5CDelta%20wrios%7D%7D%20%3D%20%5Cfrac%7B%5Cfrac%7B%5CDelta%20rqticks%7D%7Bp%7D%7D%7B%5Cfrac%7B%5CDelta%20totticks%7D%7Bp%7D%7D" /></p>

<p>The common denominators in the complex fractions can be eliminated:</p>

<!-- \frac{\frac{\Delta rqticks}{\sout{\Delta rdios + \Delta wrios}}}{\frac{\Delta totticks}{\sout{\Delta rdios + \Delta wrios}}} = \frac{\frac{\Delta rqticks}{\sout{p}}}{\frac{\Delta totticks}{\sout{p}}} -->

<p><img src="http://chart.apis.google.com/chart?cht=tx&chl=%5Cfrac%7B%5Cfrac%7B%5CDelta%20rqticks%7D%7B%5Csout%7B%5CDelta%20rdios%20%2B%20%5CDelta%20wrios%7D%7D%7D%7B%5Cfrac%7B%5CDelta%20totticks%7D%7B%5Csout%7B%5CDelta%20rdios%20%2B%20%5CDelta%20wrios%7D%7D%7D%20%3D%20%5Cfrac%7B%5Cfrac%7B%5CDelta%20rqticks%7D%7B%5Csout%7Bp%7D%7D%7D%7B%5Cfrac%7B%5CDelta%20totticks%7D%7B%5Csout%7Bp%7D%7D%7D" /></p>

<!-- \frac{\Delta rqticks}{\Delta totticks} = \frac{\Delta rqticks}{\Delta totticks} -->

<p><img src="http://chart.apis.google.com/chart?cht=tx&chl=%5Cfrac%7B%5CDelta%20rqticks%7D%7B%5CDelta%20totticks%7D%20%3D%20%5Cfrac%7B%5CDelta%20rqticks%7D%7B%5CDelta%20totticks%7D" /></p>

<p>So why was await so high in my example? It's because the IO was very bursty.

Though the device was busy only 0.55% of the time, it had 47 requests on

average in the queue (avgqu-sz / util, or 0.26 / 0.0055) when it wasn't

idle.</p>

<p>Newer versions of <i>sysstat</i> calculate await separately for reads and

writes, but doing the same manually, I saw that await for reads was very low,

while writes were very high. This doesn't present an application performance

issue, because the writes are mostly generated by periodic writing of log files

and other things which can complete asynchronously in the background. The CFQ

scheduler that is the default in most current Linux distributions generally

favors reads over writes for this reason.</p>

<p>So, what practical conclusions can we make?</p>

<ul>

    <li><i>svctm</i> should be no greater than 1 divided by the worst case IOPS

    you expect the device to deliver when not idle.</li>

    <li><i>await</i> significantly higher than <i>svctm</i> indicates that

    requests are waiting in long lines, even if most of the time the device is

    idle. However, this may be benign when dirty page flushes from writing to

    logs and such cause this.</li>

    <li>It may be good to monitor that <i>util</i> is not approaching 100%.

    However, in some configurations (loaded database servers with large RAIDs)

    %util may always be 100%. In this case, monitoring <i>avgqu-sz</i> may make

    more sense, since as it approaches the maximum concurrency of the device,

    the device approaches saturation.</li>

</ul>

</body>

</html>