On Tue, Jan 13, 2009 at 4:52 PM, Lars Ellenberg <span dir="ltr"><<a href="mailto:lars.ellenberg@linbit.com">lars.ellenberg@linbit.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div><div></div><div class="Wj3C7c">On Tue, Jan 13, 2009 at 01:06:20PM -0500, Gennadiy Nerubayev wrote:<br>
> On Fri, Dec 19, 2008 at 1:39 PM, Lars Ellenberg<br>
> <<a href="mailto:lars.ellenberg@linbit.com">lars.ellenberg@linbit.com</a>>wrote:<br>
><br>
> > On Fri, Dec 19, 2008 at 09:24:32AM -0500, Gennadiy Nerubayev wrote:<br>
> > > On Thu, Dec 18, 2008 at 1:50 PM, Lars Ellenberg <<br>
> > <a href="mailto:lars.ellenberg@linbit.com">lars.ellenberg@linbit.com</a>><br>
> > > wrote:<br>
> Small update:<br>
><br>
> 500MB/s makes sense if it's a single burst. What I'm finding is that during<br>
> a long sync, the speed fluctuates wildly, even though neither the network<br>
> link nor the storage exhibit such fluctuations on their own. I made a graph<br>
> showing this effect during a sync lasting ~40 minutes. A script ran cat<br>
> /proc/drbd ran every second, taking the first speed value. The average after<br>
> the first minute or two stabilized at ~385MB/s:<br>
<br>
</div></div>forget the "first speed value" in /proc/drbd<br>
the way it is calculated now, it takes<br>
sample of yet-to-be-synced bits every ten seconds.<br>
<br>
so (resync_left, jiffies_at_sample_time)<br>
<br>
then, when you read /proc/drbd, it calculates the "current" sync speed<br>
straight forward.<br>
but mind you, if that calculation happens only a jiffy after that sample<br>
time, you probably get a sync rate of either zero (in case during that<br>
jiffy resync_left has not changed), or a HUGE number (because<br>
there may have been a resync_left update in exactly that jiffy).<br>
<br>
we used to have "rolling averages" there, somewhen years ago,<br>
but they got lost later for no particular reason.<br>
it is a very imprecise rough estimate,<br>
don't mistake it for a measurement.<br>
<br>
if you want to actually graph something drbd related,<br>
sample the numbers for dw, dr, ns, nr<br>
(counters, unit kB, disk write/read, net send/receive)<br>
al, bm<br>
(counters, activity log and bitmap meta data write counts in requests)<br>
oos (gauge: number of out-of-sync kB)<br>
and maybe ap, lo, pe, ua<br>
(gauges, not that interessting unless finetuning by experts).<br>
<div class="Ih2E3d"><br>
> There's a definite pattern<br>
</div>that pattern is probably a sampling error of a badly behaved<br>
(as explained above) gauge, and absolutly expected.;)<br>
<br>
also, please note that whenever a new piece is cleared completely,<br>
the corresponding part of the bitmap is written,<br>
possibly causing seek and a short pause during sync...<br>
<br>
do that "experiment" again, but sample oos,<br>
and plot ( oos[t] - oos[t-3] ) / 3 ...<font color="#888888"></font></blockquote><div><br>Doh! You're right; by doing simple graphing of how oos decreases every second, I can see that it's really uniform, varying around ~375-400MB/s with no spikes whatsoever. Going to dig at this some more.<br>
<br>Thanks,<br><br>-Gennadiy</div></div>