Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
this is FYI, to document the issue, cause, and possible workarounds. when you stack xen DomU xvds on lvm on drbd, you might run into issues with spurious io errors in the DomU, journal aborts, remount-readonly, kernel panics (depending on filesystem and mount options within the DomU). and "bio would need to, but cannot, be split" messages from drbd in the kernel log of the Dom0. this is not DRBD's fault. it is not Xen's fault, either. root cause of this problem: we currently have to split anything that crosses our 32K segmentation boundaries. why we have these 32k boundaries was explained in this thread http://thread.gmane.org/gmane.linux.kernel.drbd.devel/742/focus=743 the generic bio_add_page and our bio_merge_bvec_fn take care of following that restriction. now, when you use drbd as a pv for lvm, the generic block layer and any filesystem above do not talk to drbd, but to dm-linear (most of the time), which exports a "failsafe"[*] generic PAGE_SIZE (4K) max_segment_size, but does not export any further alignment restrictions, or maximum bvec count restrictions. [*] this failsafe is basically "does the lower device (of that device mapper target) export a "bio_merge_bvec_fn"? if so, ignore it but export a max_segment_size of PAGE_SIZE. unfortunately that is not at all failsafe, because of reasons given below. it would break just as well on software raid0 (stripeset, has alignment requirements for bios similar to those of drbd). any user submitting scattered data may end up submitting a multi-bvec-bio of <= 4K, that still crosses our 32K boundary due to misallignment. why? because the generic block layer (__bio_add_page) now can only follow the restrictions exposed by device mapper, which basically only say "bio needs to be smaller than 4K". so if I do __bio_add_page for scattered sector sized data chunks, that would succeed until the bio size is 4K, but the resulting bio has multiple bvecs, and, if misaligned, may very well still cross our internal 32K boundaries. the xen block layer is known to scatter its io buffers over many pages. typical installers do partition with 63 sectors per track (misalignment). this is actually not even necessary, but legacy to old DOS. Not even recent Windows do this anymore, but rather start the first partition at sector 2048, because you get much better performance out of your potential RAID-set when not doing misaligned IO. but I digres. which means that in such a setup (xvd -> lv -> vg -> pv -> drbd) it is very likely to get misaligned multi-bvec bios. which then, well, "would need to, but cannot, be split". because the generic bio_split only handles single-vec bios, and we did not write a multi_bio_split thing yet. there are several work arounds: a) make drbd expose additional limits (max_phys_segments = 1, max_hw_segments = 1) which are correctly stacked into devicemapper, so that the block layer will never assemble a bio with multiple segments (bvecs). single-bvec bios are handled correctly by bio_split. with current drbd versions, this can be caused by limitting the lower level device (of drbd) like this (assuming drbd lives on sdb): # echo 4 > /sys/block/sdb/queue/max_sectors_kb (which may or may not cause an additional performance hit) and reattach drbd: # drbdadm down $res; drbdadm up $res we will include an option keyword with the next release, that will toggle this "max_*_segments = 1" limitation. we might also reconsider exporting the bio_merge_bvec_fn, and instead introduce a generic split function on our make_request function level, but that is some coding effort (not too much, but anyways) and will have to wait for some later release. b) fix the device mapper "failsafe" to limit itself to max_*_segments = 1 when limiting itself to PAGE_SIZE bios due to ignoring the merge_bvec_fn. c) 4K (even better: raid_chunk, e.g. 512k, 1M) align your (virtual) partitions with in the lv. [unrelated, but when living on RAID 5/10 etc., it does help performance when you chunk-align the whole stack, starting with the "physical" partitions (if any)] this is not a generic fix, but just "happens" to fix it, because the file system will now use aligned bios (of 4K max, because that is the limitation imposed by devicemapper), and now even multi-bvec bios won't cross our 32k boundaries anymore, there will not be a need to split anything. d) don't use devicemapper on top of drbd. probably not an option, since you carefully chose that stacking setup, right? but just in case, see http://www.drbd.org/users-guide/ch-xen.html and http://blogs.linbit.com/florian/2007/09/03/drbd-806-brings-full-live-migration-for-xen-on-drbd/ e) don't use DRBD [since you are reading drbd-user, this is probably not an option either] f) anything else I overlooked. -- : Lars Ellenberg Tel +43-1-8178292-55 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :