Thanks for the input. Your right in that 2 days is too little time to do this, so I'm going to manual route of shutting one server down at a time, migrating the virtual disks then bringing it back up on the remote site.<div>
<br></div><div>To avoid more downtime of manual migration once this is all over with, I think I will first attempt just getting a DRBD resource up and running to sync my servers back to the primary datacenter. Can a DRBD resource on an existing LVM be done without effecting the data ? Also since I don't plain to have automatic failover, any precautions I should take if the network connection is lost between the two datacenters ? Ideally this would allow me to have minimal downtime while the nodes re-sync.</div>
<div><br></div><div>Your correct I really don't make a habit of this. Unfortunately steam + fiber bundles = big mess. The main junction for my campus that supplies connectivity between about 2/3 of the buildings is now without protection and the glass is exposed. Even with 12 fiber splicers and multiple mobile clean rooms, this repair will take 4-5 days I'm told.<br>
<div><br></div><div>Hopefully this event will allow my organization's executives to see how critical automatic failover and replication off-site can be. Thanks again for the input.</div><div><br></div><div>- Trey<br>
<br><div class="gmail_quote">On Wed, Dec 14, 2011 at 3:44 AM, Felix Frank <span dir="ltr"><<a href="mailto:ff@mpexnet.de">ff@mpexnet.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi,<br>
<br>
for your basic questions: Yes, your design idea is sound and it should<br>
work without any major problems, see exceptions below.<br>
<br>
Getting this in production in 2 days time without any hands on Pacemaker<br>
experience, though - that's one hell of a call. (I'm assuming this isn't<br>
something you've yet made a habit of.)<br>
<br>
I suggest you focus on the DRBD side of things and see if you can<br>
establish a synced resource. From your description of the situation, a<br>
manual failover near the start of work hours will still be much<br>
preferable a week of downtime, so it may suffice?<br>
<br>
As for the aforementioned exceptions: During network failure, you will<br>
most definitely run into split-brain, i.e. the VMs in your datacenter<br>
remain operational and do whatever work they were doing when<br>
connectivity failed. The failover VMs will boot as though they had<br>
crashed at the moment of network failure. So once connectivity is<br>
restored, DRBD will tell you that in the datacenter, stuff has been<br>
written that your failover VMs never knew about.<br>
<br>
Normally (if you had time and resources), you would implement STONITH<br>
(a.k.a. fencing) to protect yourself by killing the original VMs during<br>
failover. As your time frame probably won't allow you to set this up<br>
properly ("testing? heh"), you may want to settle for the manual<br>
approach: Anticipate the split brain and be ready to discard whatever<br>
the original VMs saw fit to commit to their disks after getting<br>
disconnected.<br>
<br>
HTH,<br>
Felix<br>
<div class="HOEnZb"><div class="h5"><br>
On 12/13/2011 10:50 PM, Trey Dockendorf wrote:<br>
> I have somewhat of an emergency on my hands, and am hoping the community<br>
> may have some insight. One of the primary fiber rings on my campus will<br>
> be down for a week, unless the damaged fiber fails, it will be down<br>
> then. During this time my primary datacenter could possibly<br>
> have intermittent connectivity to other buildings / outside work.<br>
> Unfortunately we do not have the resources for a remote data center<br>
> (yet), but for now I'm setting up a remote KVM server in a portion of<br>
> campus that will not be effected. Currently all VMs are QCOW2 images<br>
> that live on a 1.2TB Logical volume. This seems like the perfect<br>
> situation to use DRBD in active-passive. However I'm not currently<br>
> trying to prepare for hardware failure but network failure. Is it<br>
> possible to do this, to have the two LVMs synced with DRBD and then have<br>
> a resource manager (Pacemaker) detect that the two can't communicate<br>
> (network failure) and then activate the passive node? I want to avoid<br>
> as much complexity as possible, so no live migration or dual primary if<br>
> possible.<br>
><br>
> This is really a temporary solution for 1 week for all<br>
> my organization's web servers, but it has made our executives aware of<br>
> our need for more funds and a remote datacenter. So hopefully this<br>
> won't be last I can use DRBD, but I have to do this in the next 2 days.<br>
><br>
> Any help is greatly appreciated.<br>
</div></div></blockquote></div><br></div></div>