[DRBD-user] DRBD or not DRBD ?

Mon Apr 25 03:57:44 CEST 2011

Digimer,

 I really thank you for your long-form discussion. So much of the writing on
this stuff is terse, making for a steep learning curve.

> You should be using Clustered LVM (clvmd). This way the LVM PV/VG/LVs
> are in sync across both nodes at all times.

I'm not yet convinced why I should use clvmd. I'm not afraid of creating
matching PV/VG/LVs by hand. It's easy to get those to match, and nothing
that's run post setup is altering their scheme. On the KISS principle, since
I'm capable enough of being stupid, I stick with the tools I know - in this
case plain LVM - unless the argument for introducing something new is
decisive. I've read some past discussion here about clvmd being required or
not, and it seemed to lean against the requirement. With each VM being on a
DRBD pair of two dedicated LV's (just for the one VM), I just don't see what
can get confused on this level. Am I missing something?

> Running the same VM on either host is suicidal, just don't, ever. To
> help prevent this, using 'resource-and-stonith' and use a script that
> fires a fence device when a split-brain occurs, then recover the lost
> VMs on the surviving node. Further, your cluster resource manager
> (rgmanager or pacemaker) should themselves require a successful fence
> before beginning resource recovery.

Yeah, I definitely have to either get a better hold on the logic of
pacemaker, or write my own scripts for this stuff. These servers have IPMI.
It would be simple in a bad state to be sure the replication link is
dropped. Since the IPMI is on the LAN side, if one server loses sight of the
other on both replication and LAN links, then it should be safe to send the
other a shutdown message over IPMI given that the other, no longer being on
the LAN, shouldn't be able to send the same message back at it at the same
time. I think.

Then the only other logic needed, aside from firing appropriate notices to
staff, is to start the list of VMs normally run on the down host. Am I
making a beginner's mistake to think this can be kept so simple: If both
links test dead for the other system, shut it down by IPMI, start up the VMs
it was responsible for running, send notices, and we're done. Now, it would
be good on restarting the other machine to have it recognize it shouldn't
fire up all its usual VMs, so there's more logic needed to be ready for that
event. But the initial failover looks simple. Pacemaker looks overly complex
and opaque - or more likely I don't understand yet how simple it would be to
set it up for this, as I'm getting lost among all it's other options. It's
not much to script from scratch though, if it's as simple as it looks in my
sketch.

> Fencing (stonith) generally defaults to "restart". This way, with a
> proper setup, the lost node will hopefully reboot in a healthy state,
> connect to the DRBD resources and resync, rejoin the cluster and, if you
> configure it to do so, relocate the VMs back to their original host.
> Personally though, I disable automatic fail-back so that I can determine
> the fault before putting the VMs back.

Hmm, restart rather than shut down. I take it there's a standard way to have
that come back up without doing its normal start of its VMs, but instead to
initialize a live migration of them back, just if the system comes up well?

> Regardless, properly configured cluster resource manager should prevent
> the same VM running twice.

...

> That said, a properly configured resource manager can be told that
> service X (ie: a VM), is only allowed to run on one node at a time.
> Then, should a user try to start it a second time, it will be denied by
> the resource manager.

Ah, but is there a way to surrender control on this level to pacemaker and
still do on-the-fly live migrations from virsh and virt-manager, or for that
matter on-the-fly startups and shutdowns of VMs, without pacemaker causing
the second host to react in any way? Having the flexibility to do ad hoc
operations by hand is important to me - just a shade less important than
having dependable failover and usable manual failback, nonetheless a high
priority.

> I have *mostly* finished a tutorial using Xen that otherwise does
> exactly what you've described here in much more detail. It's not
> perfect, and I'm still working on the failure-mode testing section, but
> I think it's far enough along to not be useless.
> 
> http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial
> 
> Even if you don't follow it, a lot of the discussion around reasonings
> and precautions should port well to what you want to do.

Thanks much. I will read that. 

Best,
Whit