[DRBD-user] DRBD failover between datacenters if one's network fails

Trey Dockendorf treydock at gmail.com
Thu Dec 15 17:09:35 CET 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Thanks for the input.  Your right in that 2 days is too little time to do
this, so I'm going to manual route of shutting one server down at a time,
migrating the virtual disks then bringing it back up on the remote site.

To avoid more downtime of manual migration once this is all over with, I
think I will first attempt just getting a DRBD resource up and running to
sync my servers back to the primary datacenter.  Can a DRBD resource on an
existing LVM be done without effecting the data ?  Also since I don't plain
to have automatic failover, any precautions I should take if the network
connection is lost between the two datacenters ?  Ideally this would allow
me to have minimal downtime while the nodes re-sync.

Your correct I really don't make a habit of this.  Unfortunately steam +
fiber bundles = big mess.  The main junction for my campus that supplies
connectivity between about 2/3 of the buildings is now without protection
and the glass is exposed.  Even with 12 fiber splicers and multiple mobile
clean rooms, this repair will take 4-5 days I'm told.

Hopefully this event will allow my organization's executives to see how
critical automatic failover and replication off-site can be.  Thanks again
for the input.

- Trey

On Wed, Dec 14, 2011 at 3:44 AM, Felix Frank <ff at mpexnet.de> wrote:

> Hi,
>
> for your basic questions: Yes, your design idea is sound and it should
> work without any major problems, see exceptions below.
>
> Getting this in production in 2 days time without any hands on Pacemaker
> experience, though - that's one hell of a call. (I'm assuming this isn't
> something you've yet made a habit of.)
>
> I suggest you focus on the DRBD side of things and see if you can
> establish a synced resource. From your description of the situation, a
> manual failover near the start of work hours will still be much
> preferable a week of downtime, so it may suffice?
>
> As for the aforementioned exceptions: During network failure, you will
> most definitely run into split-brain, i.e. the VMs in your datacenter
> remain operational and do whatever work they were doing when
> connectivity failed. The failover VMs will boot as though they had
> crashed at the moment of network failure. So once connectivity is
> restored, DRBD will tell you that in the datacenter, stuff has been
> written that your failover VMs never knew about.
>
> Normally (if you had time and resources), you would implement STONITH
> (a.k.a. fencing) to protect yourself by killing the original VMs during
> failover. As your time frame probably won't allow you to set this up
> properly ("testing? heh"), you may want to settle for the manual
> approach: Anticipate the split brain and be ready to discard whatever
> the original VMs saw fit to commit to their disks after getting
> disconnected.
>
> HTH,
> Felix
>
> On 12/13/2011 10:50 PM, Trey Dockendorf wrote:
> > I have somewhat of an emergency on my hands, and am hoping the community
> > may have some insight.  One of the primary fiber rings on my campus will
> > be down for a week, unless the damaged fiber fails, it will be down
> > then.  During this time my primary datacenter could possibly
> > have intermittent connectivity to other buildings / outside work.
> >  Unfortunately we do not have the resources for a remote data center
> > (yet), but for now I'm setting up a remote KVM server in a portion of
> > campus that will not be effected.  Currently all VMs are QCOW2 images
> > that live on a 1.2TB Logical volume.  This seems like the perfect
> > situation to use DRBD in active-passive.  However I'm not currently
> > trying to prepare for hardware failure but network failure.  Is it
> > possible to do this, to have the two LVMs synced with DRBD and then have
> > a resource manager (Pacemaker) detect that the two can't communicate
> > (network failure) and then activate the passive node?  I want to avoid
> > as much complexity as possible, so no live migration or dual primary if
> > possible.
> >
> > This is really a temporary solution for 1 week for all
> > my organization's web servers, but it has made our executives aware of
> > our need for more funds and a remote datacenter.  So hopefully this
> > won't be last I can use DRBD, but I have to do this in the next 2 days.
> >
> > Any help is greatly appreciated.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20111215/20a1bee3/attachment.htm>


More information about the drbd-user mailing list