[DRBD-user] DRBD failover between datacenters if one's network fails

Wed Dec 14 10:44:26 CET 2011

Hi,

for your basic questions: Yes, your design idea is sound and it should
work without any major problems, see exceptions below.

Getting this in production in 2 days time without any hands on Pacemaker
experience, though - that's one hell of a call. (I'm assuming this isn't
something you've yet made a habit of.)

I suggest you focus on the DRBD side of things and see if you can
establish a synced resource. From your description of the situation, a
manual failover near the start of work hours will still be much
preferable a week of downtime, so it may suffice?

As for the aforementioned exceptions: During network failure, you will
most definitely run into split-brain, i.e. the VMs in your datacenter
remain operational and do whatever work they were doing when
connectivity failed. The failover VMs will boot as though they had
crashed at the moment of network failure. So once connectivity is
restored, DRBD will tell you that in the datacenter, stuff has been
written that your failover VMs never knew about.

Normally (if you had time and resources), you would implement STONITH
(a.k.a. fencing) to protect yourself by killing the original VMs during
failover. As your time frame probably won't allow you to set this up
properly ("testing? heh"), you may want to settle for the manual
approach: Anticipate the split brain and be ready to discard whatever
the original VMs saw fit to commit to their disks after getting
disconnected.

HTH,
Felix

On 12/13/2011 10:50 PM, Trey Dockendorf wrote:
> I have somewhat of an emergency on my hands, and am hoping the community
> may have some insight.  One of the primary fiber rings on my campus will
> be down for a week, unless the damaged fiber fails, it will be down
> then.  During this time my primary datacenter could possibly
> have intermittent connectivity to other buildings / outside work.
>  Unfortunately we do not have the resources for a remote data center
> (yet), but for now I'm setting up a remote KVM server in a portion of
> campus that will not be effected.  Currently all VMs are QCOW2 images
> that live on a 1.2TB Logical volume.  This seems like the perfect
> situation to use DRBD in active-passive.  However I'm not currently
> trying to prepare for hardware failure but network failure.  Is it
> possible to do this, to have the two LVMs synced with DRBD and then have
> a resource manager (Pacemaker) detect that the two can't communicate
> (network failure) and then activate the passive node?  I want to avoid
> as much complexity as possible, so no live migration or dual primary if
> possible.
> 
> This is really a temporary solution for 1 week for all
> my organization's web servers, but it has made our executives aware of
> our need for more funds and a remote datacenter.  So hopefully this
> won't be last I can use DRBD, but I have to do this in the next 2 days.
> 
> Any help is greatly appreciated.