[DRBD-user] DRBD write speed

Mon May 9 13:50:10 CEST 2011

On Mon, May 09, 2011 at 10:09:29AM +0200, Felix Frank wrote:
> On 05/08/2011 07:46 PM, Maxim Ianoglo wrote:
> > Hello,
> > 
> > Getting some "strange" results on write testing on DRBD Primary.
> > Every time I get more data written than 1Gb link can handle.
> > Get about 133 MB/s with 1Gb link saturated and both Nodes in sync.
> > Also If I make a test with files of size less that 1GB ( for example 900MB ) I always get results between 650-750MB/s and does not meter which sync protocol 
> > I use in DRBD.
> > 
> > Does this have something to do with DRBD realization ? Buffers or something ... 
> > 
> > Here is my configuration file:
> 
> <snip>
> 
> >   disk { 
> >     on-io-error detach; 
> >     no-disk-barrier;
> >     no-disk-flushes;
> >     no-md-flushes;
> >   }
> 
> Hi,
> 
> as far as I can tell, DRBD is happily stuffing your write cache without
> doing on-the-spot syncing. That's what you're basically telling it to
> do. (It *is* syncing of course, but the writes are acknowledged by the
> local RAID controller before the Secondary has received them.)
> 
> I remember you telling us about a BBU on your RAID controller, so this
> is probably what you want.
> 
> If you want to know the raw performance of DRBD, I think you can
> a) enable disk flushes or
> b) disable your write cache
> 
> Depending on your workload, you may be going to write at full cache
> speed at about all times, so that value is valid.
> 
> A question to those more adept at the concepts:
> 
> Come to think of it, I'm not sure why this is actually a good idea. If
> the primary crashes in this setup, won't the Secondary come up with up
> to 900MB of missing writes?
> When the Primary is restored, won't it mark the data that's salvaged
> from the BBU'ed cache as dirty based on the activity log?
> 
> (Yes, that's two questions actually.)

I get the feeling that someone needs to read up about caching,
especially Linux page cache, Linux IO stack,
and where DRBD is located in there.

We typically have
  [ applications ]
  [ application and library buffers ]
  [ file systems ]
  [  page cache  ]
  [ block layer  ] <=== DRBD lives here
                  `--- drbd replication via TCP ---> remote "DISK".
  [    "DISK"    ]

where "DISK" can itself again have caches,
which may or may not be volatile.

Things that are "write()"n typically only get as far as the page cache,
potentially not even there, but only into the library buffers.

Things that are "written", but not even the block layer knows about,
cannot possibly be replicated by DRBD within/below the block layer.

So, can you specify where you think those 900 MB, that are supposedly
"lost" and "missing" after Primary crash, are located?

If they live only in the page cache still (no fdatasync or fsync or
sync yet), what is DRBD supposed to do about that?
How does that differ from a single node crash?

If they are in the "DISK" cache, it depends.
  If the cache was volatile
    and cache flushes have been *dis*abled or ignored,
      -> you subscribed to data corruption.
         DRBD would not be able to reliably determin
	 the regions that would need to be resynced,
	 and would only get it right "by accident".
    and cache flushes (barriers, respective flushes, on DRBD side)
    have been *en*abled, and are honored by the "DISK",
       -> DRBD can reliably restore the regions that potentially need
       syncing from the DRBD activity log.
  If that cache is non-volatile, it was "persistent" storage after all,
  which is all we need to know.

In all cases, if you have been using DRBD protocol B or C, all things
that have been confirmed as written to upper layers have reached the
Secondary, thus are available there in case of failover.

If you are using protocol A, well, you consciously made that decision.
With protocol A, there may be data missing on the Secondary after
failover. The worst case amount of it depends on the "capacity"
of your replication link, which typically is in the order of your
bandwidth-delay product + socket buffers, so a few MiB.
If it includes the DRBD proxy, it may even accumulate several GiB of
backlog, which is it's very purpose.

Hope that helps to put some things in perspective.

Cheers,

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed