[DRBD-user] [LVM2 + DRBD + Xen + DRBD 8.0] error on dom0 (thephysical server) and on domU (the virtual machine)

Thu Aug 16 16:34:07 CEST 2007

> -----Original Message-----
> From: drbd-user-bounces at lists.linbit.com 
> [mailto:drbd-user-bounces at lists.linbit.com] On Behalf Of Lars 
> Ellenberg
> Sent: Thursday, August 16, 2007 9:11 AM
> To: drbd-user at lists.linbit.com
> Subject: Re: [DRBD-user] [LVM2 + DRBD + Xen + DRBD 8.0] error 
> on dom0 (thephysical server) and on domU (the virtual machine)
> 
> On Thu, Aug 16, 2007 at 10:35:25AM +0200, Maxim Doucet wrote:
> > Sorry for the duplicate, a problem with my email client.
> > 
> > Here, is a forwarded email from the xen-users mailing list 
> where someone
> > encountered the same problem.
> > 
> > A workaround is given, and further testing is done so I can only
> > recommend to read it.
> > 
> > The forwarded mail
> > 
> (http://lists.xensource.com/archives/html/xen-users/2007-08/ms
> g00375.html) 
> > :
> > > On Tue, 14 Aug 2007, Maxim Doucet wrote:
> > >
> > >> I experience the following error messages when launching 
> the virtual
> > >> machine :
> > >> *On dom0 : the physical server* (messages coming from dmesg) :
> > >> drbd0: bio would need to, but cannot, be split:
> > >> (vcnt=2,idx=0,size=2048,sector=126353855)
> > >> drbd0: bio would need to, but cannot, be split:
> > >> (vcnt=2,idx=0,size=2048,sector=126353855)
> > >
> > > We are using a nearly identical configuration and 
> experienced the same
> > > problem just today:
> > >
> > > LVM2 on DRBD under Xen 3.0.3 w/ DRBD 8.0.4 Using CentOS5 on x86_64
> > > dom0 kernel 2.6.18-8.1.8-el5xen
> > >
> > > The virtual machine is an FC6 x86_64 PV guest and gave 
> similar guest
> > > errors.
> > >
> > > The workaround we are using is to change
> > >
> > > disk = [ 'phy:/dev/vg-drbd/vm0,xvda,w' ]
> > >    to
> > > disk = [ 'tap:aio:/dev/vg-drbd/vm0,xvda,w' ]
> > >
> > > This treats the underlying backing image as a file.  This may have
> > > some performance loss since it is not using direct device 
> IO, but as
> > > far as I can tell it is stable.  Or at least, phy: fails 
> miserably,
> > > where tap:aio: works fine!
> > >
> > > This seems to indicate that its not an LVM+DRBD or 
> Xen+LVM problem,
> > > but rather a Xen+LVM+DRBD using phy: problem.  I tested 
> to see if Xen
> > > liked running LVM on a loopback device and loading a VM 
> off it using
> > > phy: (see below).  It worked fine, which makes me think 
> this is more
> > > of a drbd issue than a Xen or LVM issue.
> 
> the "problem" as I see it, is, that the xen virtual block device layer
> makes wrong assumtions, creates its own bios, maybe even
> respecting "max-segmend-size", but aparently completely ignoring the
> "bdev_merge_bvec_fn". If you want to add a page to a bio, you have to
> use bio_add_page. you must not just assume that, because the 
> device has
> a max_segment_size of 32k, that it will accept a bio containing a bvec
> of 4 pages at every offset. this is not true. it may have 
> offsets where
> it can only accept a single page (and then even have to split 
> that page
> internally into two bios).
> 
> we have seen this also on md raid5, or md raid0, *no DRBD involved*.
> 
> most drivers/devices do not have any special offset limitation,
> but raid5 or raid0 have their chunk size (raid1 does not, also linear
> does not, apart from the device borders).
> 
> interesstingly when you have a devicemapper on top of some 
> other device
> with a merge_bvec_fn, devicemapper will announce a max segment size of
> 4K only, which should mask this away. you could try to verify this by
> using a dm linear mapping on top of drbd.
> 
> I did not read xen code, but I assume that it basically does
>    b = bio_alloc(,4);
>    b->bi_io_vec[0].page = page0; offset...; len...;
>    b->bi_io_vec[1].page = page1; ...
>    b->bi_io_vec[2].page = page2; ...
>    b->bi_io_vec[3].page = page3; ...
> 
> where it should do
>   b = bio_alloc(...); 
>   if (!b) whatever;
>   initialize bio with target block device etc.
>   until all pages to be submitted are submitted:
> 	  if (bio_add_page(b,page,len,offset) != len) {
> 		submit current bio;
> 		b = bio_alloc(...)
> 		if (!b) whatever;
>   		initialize bio with target block device etc.
> 	  }
> 
> but maybe I misunderstand something,
> so please correctme if I'm wrong.

Yes, I came across this error in programming myself when
developing a block layer device handler for IET.

To give an idea how I implemented it to get the max performance
of the underlying block device, which I think should work in
almost all standard block layer drivers.

1) Get the largest # of vectors/segments/pages the underlying
block device can handle via bio_get_nr_vecs()

        if (bdev_q)
                max_pages = bio_get_nr_vecs(bio_data->bdev);

2) Allocate a work structure to keep track of the bios in
flight. This will be used in the endio helper to issue
completion.

        tio_work = kzalloc(sizeof (*tio_work), GFP_KERNEL);
        if (!tio_work)
                return -ENOMEM;

        atomic_set(&tio_work->error, 0);
        atomic_set(&tio_work->bios_remaining, 0);
        init_completion(&tio_work->tio_complete);

3) Loop through the data pages allocating them to bios in chunks
of up to the size the device or kernel can handle, issuing all bios
first before submission and keeping track of them using the bios
linked list.

        /* Main processing loop, allocate and fill all bios */
        while (tio_index < tio->pg_cnt) {
                bio = bio_alloc(GFP_KERNEL, min(max_pages, BIO_MAX_PAGES));
                if (!bio) {
                        err = -ENOMEM;
                        goto out;
                }

                bio->bi_sector = ppos >> volume->blk_shift;
                bio->bi_bdev = bio_data->bdev;
                bio->bi_end_io = blockio_bio_endio;
                bio->bi_private = tio_work;
                if (tio_bio)
                        biotail = biotail->bi_next = bio;
                else
                        tio_bio = biotail = bio;

                atomic_inc(&tio_work->bios_remaining);

                /* Loop for filling bio */
                while (tio_index < tio->pg_cnt) {
                        unsigned int bytes = PAGE_SIZE - offset;

                        if (bytes > size)
                                bytes = size;

                        if (!bio_add_page(bio, tio->pvec[tio_index], bytes, offset))
                                break;

                        size -= bytes;
                        ppos += bytes;

                        offset = 0;

                        tio_index++;
                }
        }

4) Walk the submission list in a tight loop to get all the bios on the
queue as quickly as possible to take advantage of any further queue
optimizations. Don't be tempted to submit the whole linked list down
as it can cause some underlying devices to choke as it hints to the
queue that all the bios should be merged.

        /* Walk the list, submitting bios 1 by 1 */
        while (tio_bio) {
                bio = tio_bio;
                tio_bio = tio_bio->bi_next;
                bio->bi_next = NULL;

                submit_bio(rw, bio);
        }

5) Unplug the queue and let it do it's work

        if (bdev_q && bdev_q->unplug_fn)
                bdev_q->unplug_fn(bdev_q);

6) Wait for the work to complete before returning status.

        wait_for_completion(&tio_work->tio_complete);

        err = atomic_read(&tio_work->error);

        kfree(tio_work);

        return err;

This process should provide the best performance/compatibility with
the majority (if not all) of underlying block devices.

-Ross

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.