[DRBD-user] "bio would need to, but cannot, be split" (xen on lvm on drbd)

Lars Ellenberg lars.ellenberg at linbit.com
Fri May 23 18:11:04 CEST 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


this is FYI, to document the issue, cause, and possible workarounds.

when you stack
  xen DomU xvds on lvm on drbd,
you might run into issues with spurious io errors in the DomU,
journal aborts, remount-readonly, kernel panics (depending on filesystem
and mount options within the DomU).
and "bio would need to, but cannot, be split" messages from drbd
in the kernel log of the Dom0.

	this is not DRBD's fault.
	it is not Xen's fault, either.

root cause of this problem:

we currently have to split anything that crosses our 32K segmentation
boundaries.
why we have these 32k boundaries was explained in this thread
http://thread.gmane.org/gmane.linux.kernel.drbd.devel/742/focus=743

the generic bio_add_page and our bio_merge_bvec_fn take care of
following that restriction.

now, when you use drbd as a pv for lvm, the generic block layer and any
filesystem above do not talk to drbd, but to dm-linear (most of the
time), which exports a "failsafe"[*] generic PAGE_SIZE (4K)
max_segment_size, but does not export any further alignment
restrictions, or maximum bvec count restrictions.

[*] this failsafe is basically "does the lower device (of that device
mapper target) export a "bio_merge_bvec_fn"? if so, ignore it
but export a max_segment_size of PAGE_SIZE.
unfortunately that is not at all failsafe, because of reasons given
below. it would break just as well on software raid0 (stripeset,
has alignment requirements for bios similar to those of drbd).

any user submitting scattered data may end up submitting a
multi-bvec-bio of <= 4K, that still crosses our 32K boundary due to
misallignment.

why?
because the generic block layer (__bio_add_page) now can only
follow the restrictions exposed by device mapper,
which basically only say "bio needs to be smaller than 4K".
so if I do __bio_add_page for scattered sector sized data chunks,
that would succeed until the bio size is 4K, but the resulting bio
has multiple bvecs, and, if misaligned, may very well still cross our
internal 32K boundaries.

the xen block layer is known to scatter its io buffers over many pages.
typical installers do partition with 63 sectors per track (misalignment).
this is actually not even necessary, but legacy to old DOS. Not even
recent Windows do this anymore, but rather start the first partition at
sector 2048, because you get much better performance out of your
potential RAID-set when not doing misaligned IO. but I digres.

which means that in such a setup (xvd -> lv -> vg -> pv -> drbd)
it is very likely to get misaligned multi-bvec bios.
which then, well, "would need to, but cannot, be split".
because the generic bio_split only handles single-vec bios,
and we did not write a multi_bio_split thing yet.

there are several work arounds:
a)
  make drbd expose additional limits
  (max_phys_segments = 1, max_hw_segments = 1)
  which are correctly stacked into devicemapper, so that the block layer will
  never assemble a bio with multiple segments (bvecs).
  single-bvec bios are handled correctly by bio_split.

  with current drbd versions, this can be caused by limitting the
  lower level device (of drbd) like this (assuming drbd lives on sdb):
  # echo 4 > /sys/block/sdb/queue/max_sectors_kb
  (which may or may not cause an additional performance hit)
  and reattach drbd:
  # drbdadm down $res; drbdadm up $res

  we will include an option keyword with the next release,
  that will toggle this "max_*_segments = 1" limitation.

  we might also reconsider exporting the bio_merge_bvec_fn, and instead
  introduce a generic split function on our make_request function level,
  but that is some coding effort (not too much, but anyways) and will
  have to wait for some later release.

b)
  fix the device mapper "failsafe" to limit itself to max_*_segments = 1
  when limiting itself to PAGE_SIZE bios due to ignoring the
  merge_bvec_fn.

c)
  4K (even better: raid_chunk, e.g. 512k, 1M) align your (virtual)
  partitions with in the lv.
   [unrelated, but when living on RAID 5/10 etc.,
    it does help performance when you chunk-align the whole stack,
    starting with the "physical" partitions (if any)]

  this is not a generic fix, but just "happens" to fix it, because the
  file system will now use aligned bios (of 4K max, because that is the
  limitation imposed by devicemapper), and now even multi-bvec bios
  won't cross our 32k boundaries anymore, there will not be a need to
  split anything.

d)
  don't use devicemapper on top of drbd.
  probably not an option, since you carefully chose that stacking setup, right?
  but just in case, see http://www.drbd.org/users-guide/ch-xen.html and
  http://blogs.linbit.com/florian/2007/09/03/drbd-806-brings-full-live-migration-for-xen-on-drbd/

e)
  don't use DRBD
  [since you are reading drbd-user, this is probably not an option either]

f)
  anything else I overlooked.


-- 
: Lars Ellenberg                            Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :



More information about the drbd-user mailing list