[DRBD-user] Large block IO bottleneck

Wed Jan 3 18:00:16 CET 2007

> -----Original Message-----
> From: Philipp Reisner [mailto:philipp.reisner at linbit.com] 
> Sent: Wednesday, January 03, 2007 10:57 AM
> To: Ross S. W. Walker
> Cc: drbd-user at lists.linbit.com
> Subject: Re: [DRBD-user] Large block IO bottleneck
> 
> Am Mittwoch, 3. Januar 2007 16:20 schrieb Ross S. W. Walker:
> > > -----Original Message-----
> > > From: drbd-user-bounces at lists.linbit.com
> > > [mailto:drbd-user-bounces at lists.linbit.com] On Behalf Of
> > > Philipp Reisner
> > > Sent: Wednesday, January 03, 2007 5:58 AM
> > > To: drbd-user at lists.linbit.com
> > > Subject: Re: [DRBD-user] Large block IO bottleneck
> > >
> > > Am Dienstag, 2. Januar 2007 22:05 schrieb Ross S. W. Walker:
> > > > Hi there I am using DRBD 0.7.21 with iSCSI Enterprise
> > >
> > > Target 0.4.14 on
> > >
> > > > CentOS 4.4.
> > > >
> > > > When I run iSCSI direct to the LVM lv on top of hardware
> > >
> > > RAID I can get
> > >
> > > > 225 MB/s over two sessions in MPIO with 256K block size,
> > >
> > > but when I put
> > >
> > > > DRBD in-between iSCSI and LVM the throughput tops out at 80
> > >
> > > MB/s and I
> > >
> > > > can't seem to go over that.
> > > >
> > > > DRBD seems to report it's max number of sectors as 8 
> (4K), does that
> > > > mean each io operation is limited to 4K? My hardware raid
> > >
> > > reports it's
> > >
> > > > max sectors as 128, could this explain the reduction to 1/3
> > >
> > > throughput?
> > >
> > >
> > > Hi,
> > >
> > > The cause for the limitation to 4k is the Linux-2.4 
> compabitility of
> > > DRBD-0.7.
> > >
> > > Repeat your test with drbd-8.0(rc1).
> > >
> > > drbd-8.0 will do BIOs up to 32k, but much more important are other
> > > changes (e.g. the non blocking make_request() function), 
> that makes
> > > drbd-8.0 to scale much better with high end hardware.
> > >
> > > PS: What kind of network link are you using ?
> >
> > We're using dual 1Gbps adapters, one for each path in the MPIO
> > connection (actually 4 adapters 2 separate bonded pairs 
> using ALB since
> > we have multiple initiators, 4 to be exact).
> 
> So, this are 2Gbps per logical link, right ?
> 
> is this for iSCSI only, is it for DRBD also 2Gbps.

This is for iSCSI only, I have separate 1Gbps interface for replication,
but since the final destination is going over high latency low-bandwidth
link I don't think Gbps is needed.

> BTW, the load balancing of most switches is not usable for DRBD,
> since most of them does a constant mapping of target MAC to port.
> If you use that for DRBD you will get the whole traffic on a single
> 1Gbps port :(

Yes if you use 802.3ab link aggregation, but I'm using the Linux based
adaptive-load-balance which switches ARPs between packets based on load
factor, which of course means I need CPU and bus to keep up with the
increased interrupts.

But since I am not running DRBD replication over the bonded channels
this is academic.

> > Is there an issue with using the max_sectors from the underlying
> > hardware, that way DRBD would scale up or down depending on 
> the backing
> > device that is used?
> 
> The 32k are a rather arbitrary value we have chosen in DRBD. 
> Such a limit
> is needed for some of the algorithms in the area of the two primary
> nodes, write conflict detection code.

I think that's my answer, the internal data structures depend on a fixed
BIO size as opposed to using the max BIO of the underlying hardware
which will be variable depending on the hardware.

> If you really want you can change that define (HT_SHIFT), but probably
> the positive effects are out-numbered by negative effects 
> (more collision
> in hash tables etc...)
> 
> We did quite some measuring and our results where that the actual size
> of the BIOs does not influence the performance. 

Over a 1Gbps link or over >1Gbps, 2Gbps, 10Gbps?

All I know is that with the drbd configured as Primary Standalone and no
peers my throughput is capped at 80MB/s, without drbd it hits 225 MB/s.
I am sure upping the BIO to 32K will help, but I am pretty sure that it
will not reach what the hardware is capable of.

Of course the proof is in the testing though ;-)

> > Of course DRBD may have to automatically re-configure the 
> min-buffers
> > that it needs depending on the size of the BIOs it accepts, so
> > replication at that speed doesn't overflow.
> 
> >
> > The secondary peer isn't in place yet here and when it does 
> come online
> > it will be geographically separated and therefore over a 
> high latency
> > low bandwidth connection. I am planning on replicating to this peer
> > asynchronously using Prot A, is there a formula for 
> calculating optimum
> > snd_buffer based on dataset/bandwidth/latency?
> >
> 
> Huh.
> 
> I think you are concerned about performance.
> 
> The issue is, when the snd_buffer is full on the primary node, it
> has to block the writing application until there is space for
> the next write in the snd_buffer. 
> 
> The outflow rate out of the snd_buffer is the bandwith of your
> replication link.
> 
> E.g. with a snd_buffer of 1M, and a bandwith of 1MBit/sec on the 
> replication network.
> 
> Writing a 990kb file will be as fast as your local disk is.
> Writing a second 990kb file will take aprrox 10 seconds!!!
> 
> (1MByte / 100Kbyte/s =~ 10 seconds.)

So the optimum setting for the snd_buffer is based on the performance of
the underlying hardware in conjunction with the application running on
top with regard to the throughput to the remote host and that remote
host's underlying hardware?

Of course the goal is to have same performance on the local host as if
drbd was not in use. So the size of the snd_buffer should be very large
for an asynchronous peer if you plan on seeing heavy IO on the replica,
and just moderately large otherwise.

Is there any thought to doing scheduled synchronizations between two
peers instead of fully synchronous io? Or to disconnect a peer using
Prot A when the snd_buffer fills and re-connect re-sync when the
snd_buffer passes below a high water mark, so writes aren't blocked
during a high write io spike? The idea being that some replication
doesn't need to be fully synchronous and sometimes a small loss of data
is acceptable, say what is lost between log writes in a database that
just handles general ledger.

Thanks for answering my two-fer question.

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.