[DRBD-user] Speeding up sync rate on fast links and storage

Wed Jan 14 05:49:05 CET 2009

On Tue, Jan 13, 2009 at 4:52 PM, Lars Ellenberg
<lars.ellenberg at linbit.com>wrote:

> On Tue, Jan 13, 2009 at 01:06:20PM -0500, Gennadiy Nerubayev wrote:
> > On Fri, Dec 19, 2008 at 1:39 PM, Lars Ellenberg
> > <lars.ellenberg at linbit.com>wrote:
> >
> > > On Fri, Dec 19, 2008 at 09:24:32AM -0500, Gennadiy Nerubayev wrote:
> > > > On Thu, Dec 18, 2008 at 1:50 PM, Lars Ellenberg <
> > > lars.ellenberg at linbit.com>
> > > > wrote:
> > Small update:
> >
> > 500MB/s makes sense if it's a single burst. What I'm finding is that
> during
> > a long sync, the speed fluctuates wildly, even though neither the network
> > link nor the storage exhibit such fluctuations on their own. I made a
> graph
> > showing this effect during a sync lasting ~40 minutes. A script ran cat
> > /proc/drbd ran every second, taking the first speed value. The average
> after
> > the first minute or two stabilized at ~385MB/s:
>
> forget the "first speed value" in /proc/drbd
> the way it is calculated now, it takes
> sample of yet-to-be-synced bits every ten seconds.
>
> so (resync_left, jiffies_at_sample_time)
>
> then, when you read /proc/drbd, it calculates the "current" sync speed
> straight forward.
> but mind you, if that calculation happens only a jiffy after that sample
> time, you probably get a sync rate of either zero (in case during that
> jiffy resync_left has not changed), or a HUGE number (because
> there may have been a resync_left update in exactly that jiffy).
>
> we used to have "rolling averages" there, somewhen years ago,
> but they got lost later for no particular reason.
> it is a very imprecise rough estimate,
> don't mistake it for a measurement.
>
> if you want to actually graph something drbd related,
> sample the numbers for dw, dr, ns, nr
> (counters, unit kB, disk write/read, net send/receive)
> al, bm
> (counters, activity log and bitmap meta data write counts in requests)
> oos (gauge: number of out-of-sync kB)
> and maybe ap, lo, pe, ua
> (gauges, not that interessting unless finetuning by experts).
>
> > There's a definite pattern
> that pattern is probably a sampling error of a badly behaved
> (as explained above) gauge, and absolutly expected.;)
>
> also, please note that whenever a new piece is cleared completely,
> the corresponding part of the bitmap is written,
> possibly causing seek and a short pause during sync...
>
> do that "experiment" again, but sample oos,
> and plot  ( oos[t] - oos[t-3] ) / 3 ...

Doh! You're right; by doing simple graphing of how oos decreases every
second, I can see that it's really uniform, varying around ~375-400MB/s with
no spikes whatsoever. Going to dig at this some more.

Thanks,

-Gennadiy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20090113/b81a1270/attachment.htm>