[DRBD-user] Performance with DRBD + iSCSI

Ross S. W. Walker rwalker at medallion.com
Thu Feb 22 02:23:03 CET 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


> -----Original Message-----
> From: Weilin Gong [mailto:wgong at alcatel-lucent.com] 
> Sent: Wednesday, February 21, 2007 7:39 PM
> To: Ross S. W. Walker
> Cc: drbd-user at linbit.com
> Subject: Re: [DRBD-user] Performance with DRBD + iSCSI
> 
> Ross S. W. Walker wrote:
> >> -----Original Message-----
> >> From: Weilin Gong [mailto:wgong at alcatel-lucent.com] 
> >> Sent: Wednesday, February 21, 2007 5:47 PM
> >> To: Ross S. W. Walker
> >> Cc: drbd-user at linbit.com
> >> Subject: Re: [DRBD-user] Performance with DRBD + iSCSI
> >>
> >> Ross S. W. Walker wrote:
> >>     
> >>>> -----Original Message-----
> >>>> From: drbd-user-bounces at lists.linbit.com 
> >>>> [mailto:drbd-user-bounces at lists.linbit.com] On Behalf Of 
> >>>>         
> >> Weilin Gong
> >>     
> >>>> Sent: Wednesday, February 21, 2007 1:11 PM
> >>>> Cc: drbd-user at lists.linbit.com
> >>>> Subject: Re: [DRBD-user] Performance with DRBD + iSCSI
> >>>>
> >>>> Ross S. W. Walker wrote:
> >>>>     
> >>>>         
> >>>>> You can only write into drbd using what your application 
> >>>>>       
> >>>>>           
> >>>> can handle and
> >>>>     
> >>>>         
> >>>>> for VFS file operations that is 4k io! 
> >>>>>       
> >>>>>           
> >>>> On Solaris ufs, the "maxcontig" parameter can be tuned to 
> >>>>         
> >> specify the 
> >>     
> >>>> the number of contiguous
> >>>> blocks written to the disk. Haven't found the equivalence on 
> >>>> Linux yet.
> >>>>     
> >>>>         
> >>> Well, if you write your own app you can bypass VFS page-memory io
> >>> restriction by using the generic block layer.
> >>>
> >>> I'm not sure if you quite understand the maxcontig 
> parameter either:
> >>>
> >>> maxcontig=n The maximum number of logical
> >>> blocks, belonging to one file, that
> >>> are allocated contiguously. The
> >>> default is calculated as follows...
> >>>
> >>> This parameter is for tuning disk space allocation in order 
> >>>       
> >> to reduce
> >>     
> >>> fragmentation, it doesn't affect the io block size.
> >>>   
> >>>       
> >> Actually, this defines the max io data size the file system 
> >> sends down 
> >> to the driver.
> >> We have a home grown vdisk driver, similar to drbd, the 
> >> "maxcontig" had 
> >> to be
> >> tuned to match the size of the buffer allocated for the 
> >> network transport.
> >>     
> >
> > Ok, so I found this:
> >
> > ----------------
> > Tune maxcontig
> >
> > Under the Solaris OE, UFS uses an extent-like feature 
> called clustering.
> > It is impossible to have a default setting for maxcontig 
> that is optimal
> > for all file systems. It is too application dependent. Many 
> small files
> > accessed in a random pattern do not need extents and performance can
> > suffer for both reads and writes when using extents. Larger 
> files can
> > benefit from read-ahead on reads and improved allocation units when
> > using extents in writes.
> >
> > For reads, the extent-like feature is really just a read ahead. To
> > simply and dynamically tune the read-ahead algorithm, use 
> the tunefs(1M)
> > command as follows:
> >
> > # tunefs -a 4 /ufs1
> >
> > The value changed is maxcontig, which sets the number of file system
> > blocks read in read ahead. The preceding example changes the maximum
> > contiguous block count from 32 (the default) to 4.
> >
> > When a process reads more than one file system block, the kernel
> > schedules reads to fill the rest of maxcontig * file system 
> blocksize
> > bytes. A single 8 kilobyte, or smaller, random read on a 
> file does not
> > trigger read ahead. Read ahead does not occur on files 
> being read with
> > mmap(2).
> >
> > The kernel attempts to automatically detect whether an 
> application is
> > doing small random or large sequential I/O. This often 
> works fine, but
> > the definition of small or large depends more on system application
> > criteria than on device characteristics. Tune maxcontig to obtain
> > optimal performance.
> > ----------------
> >
> > So this is like setting read-ahead with, blockdev --setra XX /dev/XX
> >   
> "maxcontig" is also used for write:
> 
> In UFS, the filesystem cluster size, for both reads and 
> writes, is set 
> to the value set for /maxcontig/. The filesystem cluster size 
> is used to 
> determine:
> 
>     * The maximum number of logical blocks contiguously laid 
> out on disk
>       for a UFS filesystem before inserting a rotational delay.
>     * When, and the amount to read ahead and/or write behind if the
>       sequential IO case is found. The algorithm that determines
>       sequential read ahead in UFS is broken, so system administrators
>       use the /maxcontig/ value to tune their filesystems to achieve
>       better random I/O performance.
>     * The UFS filesystem cluster size also indicates how many pages to
>       attempt to push out to disk at a time. It also determines the
>       frequency of pushing pages because in UFS pages are 
> clustered for
>       writes, based on the filesystem cluster size.

Interesting, so much seemingly unrelated performance metrics are tuned
with a single parameter, must be hard to get it just right.

In Linux, each device driver will have it's own max_sectors and
max_phys_segments, which in turn will determine how large a request that
driver can process and the i/o scheduler will make sure when merging
contig requests on the queue not to build them past that point.
Max_sectors is the maximum number of disk sectors per request, while
max_phys_segments is the maximum number of disperse memory locations
that the disk drive can do DMA to in a request.

Some drivers have tunable max_sectors, but max_phys_segments is a
limitation of the HBA.

Read-ahead is tunable as above, write-back is handled transparently by
the page cache, but can be tuned to a degree using VFS API features,
mainly size is a concern, as if the write-back gets too big then there
is a substantial performance penalty to other i/o ops in flight when it
finally flushes.

-Ross

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.




More information about the drbd-user mailing list