On Tue, Jan 13, 2009 at 4:52 PM, Lars Ellenberg <span dir="ltr">&lt;<a href="mailto:lars.ellenberg@linbit.com">lars.ellenberg@linbit.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div><div></div><div class="Wj3C7c">On Tue, Jan 13, 2009 at 01:06:20PM -0500, Gennadiy Nerubayev wrote:<br>

&gt; On Fri, Dec 19, 2008 at 1:39 PM, Lars Ellenberg<br>

&gt; &lt;<a href="mailto:lars.ellenberg@linbit.com">lars.ellenberg@linbit.com</a>&gt;wrote:<br>

&gt;<br>

&gt; &gt; On Fri, Dec 19, 2008 at 09:24:32AM -0500, Gennadiy Nerubayev wrote:<br>

&gt; &gt; &gt; On Thu, Dec 18, 2008 at 1:50 PM, Lars Ellenberg &lt;<br>

&gt; &gt; <a href="mailto:lars.ellenberg@linbit.com">lars.ellenberg@linbit.com</a>&gt;<br>

&gt; &gt; &gt; wrote:<br>

&gt; Small update:<br>

&gt;<br>

&gt; 500MB/s makes sense if it&#39;s a single burst. What I&#39;m finding is that during<br>

&gt; a long sync, the speed fluctuates wildly, even though neither the network<br>

&gt; link nor the storage exhibit such fluctuations on their own. I made a graph<br>

&gt; showing this effect during a sync lasting ~40 minutes. A script ran cat<br>

&gt; /proc/drbd ran every second, taking the first speed value. The average after<br>

&gt; the first minute or two stabilized at ~385MB/s:<br>

<br>

</div></div>forget the &quot;first speed value&quot; in /proc/drbd<br>

the way it is calculated now, it takes<br>

sample of yet-to-be-synced bits every ten seconds.<br>

<br>

so (resync_left, jiffies_at_sample_time)<br>

<br>

then, when you read /proc/drbd, it calculates the &quot;current&quot; sync speed<br>

straight forward.<br>

but mind you, if that calculation happens only a jiffy after that sample<br>

time, you probably get a sync rate of either zero (in case during that<br>

jiffy resync_left has not changed), or a HUGE number (because<br>

there may have been a resync_left update in exactly that jiffy).<br>

<br>

we used to have &quot;rolling averages&quot; there, somewhen years ago,<br>

but they got lost later for no particular reason.<br>

it is a very imprecise rough estimate,<br>

don&#39;t mistake it for a measurement.<br>

<br>

if you want to actually graph something drbd related,<br>

sample the numbers for dw, dr, ns, nr<br>

(counters, unit kB, disk write/read, net send/receive)<br>

al, bm<br>

(counters, activity log and bitmap meta data write counts in requests)<br>

oos (gauge: number of out-of-sync kB)<br>

and maybe ap, lo, pe, ua<br>

(gauges, not that interessting unless finetuning by experts).<br>

<div class="Ih2E3d"><br>

&gt; There&#39;s a definite pattern<br>

</div>that pattern is probably a sampling error of a badly behaved<br>

(as explained above) gauge, and absolutly expected.;)<br>

<br>

also, please note that whenever a new piece is cleared completely,<br>

the corresponding part of the bitmap is written,<br>

possibly causing seek and a short pause during sync...<br>

<br>

do that &quot;experiment&quot; again, but sample oos,<br>

and plot &nbsp;( oos[t] - oos[t-3] ) / 3 ...<font color="#888888"></font></blockquote><div><br>Doh! You&#39;re right; by doing simple graphing of how oos decreases every second, I can see that it&#39;s really uniform, varying around ~375-400MB/s with no spikes whatsoever. Going to dig at this some more.<br>

<br>Thanks,<br><br>-Gennadiy</div></div>