[DRBD-user] DRBD or not DRBD ?

Sun Apr 24 22:05:46 CEST 2011

Comments in-line.

On 04/24/2011 11:34 AM, Whit Blauvelt wrote:
> Digimer,
> 
> All useful stuff. Thanks. I hadn't considered three rather than two
> networks. That's a good case for it.
> 
> Here's what I'm trying to scope out, and from your comments it looks to be
> territory you're well familiar with. I've got two systems set up with KVM
> VMs, where each VM is on its own LVM, currently each with primary-secondary
> DRBD, where the primary roles are balanced across the two machines. As far
> as I can tell, and from past comments here, It's necessary to go
> primary-primary to enable KVM live migration, which is a very nice feature
> to have. None of the VMs in this case face critical issues with disk
> performance, so primary-primary slowing that, if it does in this context,
> isn't a problem.

You do need Primary/Primary for live migration. Another trick I like to
do is to spread the VMs out between the nodes. To this end, I create two
DRBD resources; One to host the VMs that normally run on one node, and
the other for the VMs that run mainly on the second node. This way, in
the case of a split-brain during normal operation where the VMs ran on
either side at once, you can Invalidate each's remote side independently.

> Since each VM is in raw format, directly on top of DRBD, on top of its
> dedicated LVM, there is no normal running condition where locking should be
> an issue. That is, there's no time, when the systems are both running well,
> when both copies of a VM will be live - aside from during migration, where
> libvirt handles that well.

You should be using Clustered LVM (clvmd). This way the LVM PV/VG/LVs
are in sync across both nodes at all times. Clustered LVM though
requires distributed locking (dlm) though, so there you have it. :)

The idea that VMs will only run on one node at a time is not enough.
It's true that only one physical host will write to the VM's LV.
However, Clustered LVM doesn't understand what might run on it, and it
simply requires DLM.

> It's the abnormal conditions that require planning. In basic primary-primary
> it's possible to end up with the same VM on each host running based on the
> same storage at the same time. When that happens, even cluster locking won't
> necessarily prevent corruption, since the two instances can be doing
> inconsistent stuff in different areas of the storage, in ways that locks at
> the file system level can't prevent. 

Running the same VM on either host is suicidal, just don't, ever. To
help prevent this, using 'resource-and-stonith' and use a script that
fires a fence device when a split-brain occurs, then recover the lost
VMs on the surviving node. Further, your cluster resource manager
(rgmanager or pacemaker) should themselves require a successful fence
before beginning resource recovery.

Shared storage without fencing is a terribly bad idea. Almost all
servers have IPMI (or OEM versions like DRAC, iLO, etc), so there should
be no reason not to have fencing using those. Even if you don't, a
network switched PDU from companies like APC can be had for ~$500, which
will be a small percentage of your cluster costs.

With fencing (aka "stonith"), you should never be in a split-brain
condition and thus never be able to start the same VM twice.

> There are two basic contexts where both copies of a VM could be actively
> running at once like that. One is in a state of failover. In a way failover
> initiation should be simpler here than that between non-VM systems. No
> applications per se need to be started when one system goes down. It's just
> that the VMs that were primary on it need to be started on the survivor. At
> the same time, some variation of stonith needs to be aimed at the down
> system to be sure it doesn't recover and create dueling VMs. Any hints at
> what the most effective way of accomplishing that (probably using IPMI in my
> case) will be welcomed.

Fencing (stonith) generally defaults to "restart". This way, with a
proper setup, the lost node will hopefully reboot in a healthy state,
connect to the DRBD resources and resync, rejoin the cluster and, if you
configure it to do so, relocate the VMs back to their original host.
Personally though, I disable automatic fail-back so that I can determine
the fault before putting the VMs back.

Regardless, properly configured cluster resource manager should prevent
the same VM running twice.

> The other way to get things in a bad state, if it's a primary-primary setup
> for each VM, is operator error. I can't see any obvious way to block this,
> other than running primary-secondary instead, and sacrificing the live
> migration capacity. It doesn't look like libvirt, virsh and virt-manager
> have any way to test whether a VM is already running on the other half of a
> two-system mirror, so they might decline to start it when that's the case.

There is no antidote to user error. Simply put, the only safe way to
mitigate is to have a second cluster where tests are run and tasks are
tested on before executing on the live cluster.

That said, a properly configured resource manager can be told that
service X (ie: a VM), is only allowed to run on one node at a time.
Then, should a user try to start it a second time, it will be denied by
the resource manager.

So build the cluster, make the VMs services, and then always manage the
VMs in the context of the resource manager. If you stick to this, it
should be as safe as possible to avoid user error.

> Maybe I'm missing something obvious? Is there, for instance, a way to run
> primary-secondary just up to when a live migration's desired, and go
> primary-primary in DRBD for just long enough to migrate? 
> 
> Thanks,
> Whit

I have *mostly* finished a tutorial using Xen that otherwise does
exactly what you've described here in much more detail. It's not
perfect, and I'm still working on the failure-mode testing section, but
I think it's far enough along to not be useless.

http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial

Even if you don't follow it, a lot of the discussion around reasonings
and precautions should port well to what you want to do.

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org