Lars,<div><div><br></div><div>> first, _why_ did you have to run<br>> a recovery on your software raid, at all?<br></div><div><br></div><div>The reason for the re-sync was due to saving a copy of the data "disk" during the DRBD and kernel upgrades, so once everything looked good, I needed to sync up my data disks.</div>
<div><br></div></div><div>> second, _any_ io on the devices under md recovery<br>> will cause "degradation" of the sync.<br>> especially random io, or io causing it to seek often.<br></div><div><br></div>
<div>My main concern with the assumption that DRBD having a device enabled is causing the resync to go slow is because *any* disk I/O will slow the sync ... that doesn't explain why the host with the Primary roles was able to complete the same exact re-sync in 1/3 of the time.</div>
<div><br></div><div>The DRBD devices were in sync when these mdadm re-syncs were running, so any traffic to the DRBD devices should have had the same disruptive affect on *both* hosts.</div><div><br></div><div>Is there something different about the host that has DRBD devices in Secondary roles vs Primary roles? Protocol C, in case that helps.</div>
<div><br></div><div><div class="Ih2E3d">>> and a pretty disturbing issue with resizing an LV/DRBD<br>>> device/Filesystem with flexible-meta-disk: internal,<br><br></div>> care to give details?<br></div><div>
<br></div><div>I posted to the drbd-users list on <span class="Apple-style-span" style="font-family: arial, helvetica, sans-serif;"><span class="Apple-style-span" style="font-size: small;">Tue Nov 11 08:14:37 CET 2008, but got no response. I ended up having to try to salvage the data manually (almost 170GB worth) and recreate the DRBD devices. Quite a bummer. I was hoping the internal metadata would make Xen live migration easier, but, if it causes the resize to fail, that's not so good. I've switched the new DRBD device back to external metadata device and will test with it.</span></span><span class="Apple-style-span" style="font-family: arial; font-size: 13px; font-style: normal; "></span></div>
<div><br></div><div><span class="Apple-style-span" style="font-family: 'Times New Roman'; font-size: 16px; font-style: italic; "><span class="Apple-style-span" style="font-family: arial; font-size: 13px; font-style: normal; ">Thanks for the follow up ... I will investigate a bit more.</span></span></div>
<div><br></div><div>BTW, this is not intended to be a dig at DRBD ... I've been using DRBD since older v0.7.x ... I'm just trying to report things I've observed in case there is a special case or bug that can be discovered and fixed.</div>
<div><br></div><div>For a visual of the basic layout of our setup, see <a href="http://skylab.org/~mush/Xen_Server_Architecture.png">http://skylab.org/~mush/Xen_Server_Architecture.png</a></div><div><br></div><div>That was done in June 2006 to describe the configuration we'd been running with for a while ... our current implementation is similar, just newer code, newer servers, and much, much more storage (and more guests). The reasoning behind putting the md raid1 below LVM is so that if we loose a physical disk, we don't have to fix a *bunch* of little mirrors or other devices. It has proven to be a stable and flexible architecture. We run guests on both servers (load balancing), which we would not be able to do if we had DRBD lower in the stack (i.e. under the LVM).</div>
<div><br></div><div>The next iteration is to swap out the md raid1 for a md raid5 (yeah, some performance will get hurt, but linux raid5 can be grown dynamically a disk at a time, which I can't do in our current architecture), and use the new stacking features of DRBD 8.3 to keep a remote node in sync. Our current remote sync is done with rsync, but our datasets are so large now, it takes several hours just to build the disk list (as well as >1GB of RAM). Considering the data changes each day are, at most, a few percent of the total data space, something like DRBD makes a LOT more sense!</div>
<div><br></div><div>Sorry for going a little off topic, but I thought someone else might gain some value from seeing how our environment has been built and running for ~3 years now.</div><div><br></div><div>Thanks,<br></div>
<div>Sam</div>