[DRBD-user] DRBD or not DRBD ?

Mon Apr 25 04:21:44 CEST 2011

On 04/24/2011 09:57 PM, Whit Blauvelt wrote:
> Digimer,
> 
>  I really thank you for your long-form discussion. So much of the writing on
> this stuff is terse, making for a steep learning curve.
> 
>> You should be using Clustered LVM (clvmd). This way the LVM PV/VG/LVs
>> are in sync across both nodes at all times.
> 
> I'm not yet convinced why I should use clvmd. I'm not afraid of creating
> matching PV/VG/LVs by hand. It's easy to get those to match, and nothing
> that's run post setup is altering their scheme. On the KISS principle, since
> I'm capable enough of being stupid, I stick with the tools I know - in this
> case plain LVM - unless the argument for introducing something new is
> decisive. I've read some past discussion here about clvmd being required or
> not, and it seemed to lean against the requirement. With each VM being on a
> DRBD pair of two dedicated LV's (just for the one VM), I just don't see what
> can get confused on this level. Am I missing something?

Matching LVs are not the same LVs. The LV with your VM is a single item,
and having it treated as such, which you get with clvmd, will ensure
that it's not startable on either node at the same time.

>> Running the same VM on either host is suicidal, just don't, ever. To
>> help prevent this, using 'resource-and-stonith' and use a script that
>> fires a fence device when a split-brain occurs, then recover the lost
>> VMs on the surviving node. Further, your cluster resource manager
>> (rgmanager or pacemaker) should themselves require a successful fence
>> before beginning resource recovery.
> 
> Yeah, I definitely have to either get a better hold on the logic of
> pacemaker, or write my own scripts for this stuff. These servers have IPMI.
> It would be simple in a bad state to be sure the replication link is
> dropped. Since the IPMI is on the LAN side, if one server loses sight of the
> other on both replication and LAN links, then it should be safe to send the
> other a shutdown message over IPMI given that the other, no longer being on
> the LAN, shouldn't be able to send the same message back at it at the same
> time. I think.

RHCS's rgmanager is much simpler than Pacemaker, and is well tested and
already exists. Writing your own scripts is, I'd argue, a fools errand. :)

As for fencing; It's always ideal to have two fence devices on separate
interfaces and switch, otherwise you're back to a single point of
failure again. If you lose a switch though and all network traffic is
stopped, you're not going to make much use of your VMs anyway.

> Then the only other logic needed, aside from firing appropriate notices to
> staff, is to start the list of VMs normally run on the down host. Am I
> making a beginner's mistake to think this can be kept so simple: If both
> links test dead for the other system, shut it down by IPMI, start up the VMs
> it was responsible for running, send notices, and we're done. Now, it would
> be good on restarting the other machine to have it recognize it shouldn't
> fire up all its usual VMs, so there's more logic needed to be ready for that
> event. But the initial failover looks simple. Pacemaker looks overly complex
> and opaque - or more likely I don't understand yet how simple it would be to
> set it up for this, as I'm getting lost among all it's other options. It's
> not much to script from scratch though, if it's as simple as it looks in my
> sketch.

I must admit, you lost me somewhat in your reference to emailing people. :)

The VMs that are lost when a node dies can be started manually on the
survivor, if that is what you wish. You still need the cluster for DLM
and fencing, but forgo the resource manager. However, I think you'd be
missing on the major benefit of clustering in that case. Just the same
though, having the VM data replicated would still reduce your MTTR.

>> Fencing (stonith) generally defaults to "restart". This way, with a
>> proper setup, the lost node will hopefully reboot in a healthy state,
>> connect to the DRBD resources and resync, rejoin the cluster and, if you
>> configure it to do so, relocate the VMs back to their original host.
>> Personally though, I disable automatic fail-back so that I can determine
>> the fault before putting the VMs back.
> 
> Hmm, restart rather than shut down. I take it there's a standard way to have
> that come back up without doing its normal start of its VMs, but instead to
> initialize a live migration of them back, just if the system comes up well?

If the node successfully rejoins the cluster and resync's the DRBD
resources, then you can have it live-migrate the VMs back automatically
if you wish. However, as I mentioned, I recommend leaving the VMs on the
surviving node and manually live-migrate them back once you've sorted
out what went wrong in the first place. This behaviour is configurable
in your resource manager of choice.

>> Regardless, properly configured cluster resource manager should prevent
>> the same VM running twice.
> 
> ...
> 
>> That said, a properly configured resource manager can be told that
>> service X (ie: a VM), is only allowed to run on one node at a time.
>> Then, should a user try to start it a second time, it will be denied by
>> the resource manager.
> 
> Ah, but is there a way to surrender control on this level to pacemaker and
> still do on-the-fly live migrations from virsh and virt-manager, or for that
> matter on-the-fly startups and shutdowns of VMs, without pacemaker causing
> the second host to react in any way? Having the flexibility to do ad hoc
> operations by hand is important to me - just a shade less important than
> having dependable failover and usable manual failback, nonetheless a high
> priority.

I can't speak to pacemaker personally, but I certainly expect there is.
In rgmanager, you control the VMs directly using:

Start (enable):
clusvcadm -e vm:foo -m node1

Stop (disable):
clusvcadm -d vm:foo

Live migrate:
clusvcadm -M vm:foo -m node2

Note: clusvcadm == Cluster Service Administrator

>> I have *mostly* finished a tutorial using Xen that otherwise does
>> exactly what you've described here in much more detail. It's not
>> perfect, and I'm still working on the failure-mode testing section, but
>> I think it's far enough along to not be useless.
>>
>> http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial
>>
>> Even if you don't follow it, a lot of the discussion around reasonings
>> and precautions should port well to what you want to do.
> 
> Thanks much. I will read that. 
> 
> Best,
> Whit

Hope it helps. :)

-- 
Digimer
E-Mail: digimer at alteeve.com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org