Thanks for the input.  Your right in that 2 days is too little time to do this, so I&#39;m going to manual route of shutting one server down at a time, migrating the virtual disks then bringing it back up on the remote site.<div>

<br></div><div>To avoid more downtime of manual migration once this is all over with, I think I will first attempt just getting a DRBD resource up and running to sync my servers back to the primary datacenter.  Can a DRBD resource on an existing LVM be done without effecting the data ?  Also since I don&#39;t plain to have automatic failover, any precautions I should take if the network connection is lost between the two datacenters ?  Ideally this would allow me to have minimal downtime while the nodes re-sync.</div>

<div><br></div><div>Your correct I really don&#39;t make a habit of this.  Unfortunately steam + fiber bundles = big mess.  The main junction for my campus that supplies connectivity between about 2/3 of the buildings is now without protection and the glass is exposed.  Even with 12 fiber splicers and multiple mobile clean rooms, this repair will take 4-5 days I&#39;m told.<br>

<div><br></div><div>Hopefully this event will allow my organization&#39;s executives to see how critical automatic failover and replication off-site can be.  Thanks again for the input.</div><div><br></div><div>- Trey<br>

<br><div class="gmail_quote">On Wed, Dec 14, 2011 at 3:44 AM, Felix Frank <span dir="ltr">&lt;<a href="mailto:ff@mpexnet.de">ff@mpexnet.de</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi,<br>

<br>

for your basic questions: Yes, your design idea is sound and it should<br>

work without any major problems, see exceptions below.<br>

<br>

Getting this in production in 2 days time without any hands on Pacemaker<br>

experience, though - that&#39;s one hell of a call. (I&#39;m assuming this isn&#39;t<br>

something you&#39;ve yet made a habit of.)<br>

<br>

I suggest you focus on the DRBD side of things and see if you can<br>

establish a synced resource. From your description of the situation, a<br>

manual failover near the start of work hours will still be much<br>

preferable a week of downtime, so it may suffice?<br>

<br>

As for the aforementioned exceptions: During network failure, you will<br>

most definitely run into split-brain, i.e. the VMs in your datacenter<br>

remain operational and do whatever work they were doing when<br>

connectivity failed. The failover VMs will boot as though they had<br>

crashed at the moment of network failure. So once connectivity is<br>

restored, DRBD will tell you that in the datacenter, stuff has been<br>

written that your failover VMs never knew about.<br>

<br>

Normally (if you had time and resources), you would implement STONITH<br>

(a.k.a. fencing) to protect yourself by killing the original VMs during<br>

failover. As your time frame probably won&#39;t allow you to set this up<br>

properly (&quot;testing? heh&quot;), you may want to settle for the manual<br>

approach: Anticipate the split brain and be ready to discard whatever<br>

the original VMs saw fit to commit to their disks after getting<br>

disconnected.<br>

<br>

HTH,<br>

Felix<br>

<div class="HOEnZb"><div class="h5"><br>

On 12/13/2011 10:50 PM, Trey Dockendorf wrote:<br>

&gt; I have somewhat of an emergency on my hands, and am hoping the community<br>

&gt; may have some insight.  One of the primary fiber rings on my campus will<br>

&gt; be down for a week, unless the damaged fiber fails, it will be down<br>

&gt; then.  During this time my primary datacenter could possibly<br>

&gt; have intermittent connectivity to other buildings / outside work.<br>

&gt;  Unfortunately we do not have the resources for a remote data center<br>

&gt; (yet), but for now I&#39;m setting up a remote KVM server in a portion of<br>

&gt; campus that will not be effected.  Currently all VMs are QCOW2 images<br>

&gt; that live on a 1.2TB Logical volume.  This seems like the perfect<br>

&gt; situation to use DRBD in active-passive.  However I&#39;m not currently<br>

&gt; trying to prepare for hardware failure but network failure.  Is it<br>

&gt; possible to do this, to have the two LVMs synced with DRBD and then have<br>

&gt; a resource manager (Pacemaker) detect that the two can&#39;t communicate<br>

&gt; (network failure) and then activate the passive node?  I want to avoid<br>

&gt; as much complexity as possible, so no live migration or dual primary if<br>

&gt; possible.<br>

&gt;<br>

&gt; This is really a temporary solution for 1 week for all<br>

&gt; my organization&#39;s web servers, but it has made our executives aware of<br>

&gt; our need for more funds and a remote datacenter.  So hopefully this<br>

&gt; won&#39;t be last I can use DRBD, but I have to do this in the next 2 days.<br>

&gt;<br>

&gt; Any help is greatly appreciated.<br>

</div></div></blockquote></div><br></div></div>