Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On 04/24/2011 09:57 PM, Whit Blauvelt wrote: > Digimer, > > I really thank you for your long-form discussion. So much of the writing on > this stuff is terse, making for a steep learning curve. > >> You should be using Clustered LVM (clvmd). This way the LVM PV/VG/LVs >> are in sync across both nodes at all times. > > I'm not yet convinced why I should use clvmd. I'm not afraid of creating > matching PV/VG/LVs by hand. It's easy to get those to match, and nothing > that's run post setup is altering their scheme. On the KISS principle, since > I'm capable enough of being stupid, I stick with the tools I know - in this > case plain LVM - unless the argument for introducing something new is > decisive. I've read some past discussion here about clvmd being required or > not, and it seemed to lean against the requirement. With each VM being on a > DRBD pair of two dedicated LV's (just for the one VM), I just don't see what > can get confused on this level. Am I missing something? Matching LVs are not the same LVs. The LV with your VM is a single item, and having it treated as such, which you get with clvmd, will ensure that it's not startable on either node at the same time. >> Running the same VM on either host is suicidal, just don't, ever. To >> help prevent this, using 'resource-and-stonith' and use a script that >> fires a fence device when a split-brain occurs, then recover the lost >> VMs on the surviving node. Further, your cluster resource manager >> (rgmanager or pacemaker) should themselves require a successful fence >> before beginning resource recovery. > > Yeah, I definitely have to either get a better hold on the logic of > pacemaker, or write my own scripts for this stuff. These servers have IPMI. > It would be simple in a bad state to be sure the replication link is > dropped. Since the IPMI is on the LAN side, if one server loses sight of the > other on both replication and LAN links, then it should be safe to send the > other a shutdown message over IPMI given that the other, no longer being on > the LAN, shouldn't be able to send the same message back at it at the same > time. I think. RHCS's rgmanager is much simpler than Pacemaker, and is well tested and already exists. Writing your own scripts is, I'd argue, a fools errand. :) As for fencing; It's always ideal to have two fence devices on separate interfaces and switch, otherwise you're back to a single point of failure again. If you lose a switch though and all network traffic is stopped, you're not going to make much use of your VMs anyway. > Then the only other logic needed, aside from firing appropriate notices to > staff, is to start the list of VMs normally run on the down host. Am I > making a beginner's mistake to think this can be kept so simple: If both > links test dead for the other system, shut it down by IPMI, start up the VMs > it was responsible for running, send notices, and we're done. Now, it would > be good on restarting the other machine to have it recognize it shouldn't > fire up all its usual VMs, so there's more logic needed to be ready for that > event. But the initial failover looks simple. Pacemaker looks overly complex > and opaque - or more likely I don't understand yet how simple it would be to > set it up for this, as I'm getting lost among all it's other options. It's > not much to script from scratch though, if it's as simple as it looks in my > sketch. I must admit, you lost me somewhat in your reference to emailing people. :) The VMs that are lost when a node dies can be started manually on the survivor, if that is what you wish. You still need the cluster for DLM and fencing, but forgo the resource manager. However, I think you'd be missing on the major benefit of clustering in that case. Just the same though, having the VM data replicated would still reduce your MTTR. >> Fencing (stonith) generally defaults to "restart". This way, with a >> proper setup, the lost node will hopefully reboot in a healthy state, >> connect to the DRBD resources and resync, rejoin the cluster and, if you >> configure it to do so, relocate the VMs back to their original host. >> Personally though, I disable automatic fail-back so that I can determine >> the fault before putting the VMs back. > > Hmm, restart rather than shut down. I take it there's a standard way to have > that come back up without doing its normal start of its VMs, but instead to > initialize a live migration of them back, just if the system comes up well? If the node successfully rejoins the cluster and resync's the DRBD resources, then you can have it live-migrate the VMs back automatically if you wish. However, as I mentioned, I recommend leaving the VMs on the surviving node and manually live-migrate them back once you've sorted out what went wrong in the first place. This behaviour is configurable in your resource manager of choice. >> Regardless, properly configured cluster resource manager should prevent >> the same VM running twice. > > ... > >> That said, a properly configured resource manager can be told that >> service X (ie: a VM), is only allowed to run on one node at a time. >> Then, should a user try to start it a second time, it will be denied by >> the resource manager. > > Ah, but is there a way to surrender control on this level to pacemaker and > still do on-the-fly live migrations from virsh and virt-manager, or for that > matter on-the-fly startups and shutdowns of VMs, without pacemaker causing > the second host to react in any way? Having the flexibility to do ad hoc > operations by hand is important to me - just a shade less important than > having dependable failover and usable manual failback, nonetheless a high > priority. I can't speak to pacemaker personally, but I certainly expect there is. In rgmanager, you control the VMs directly using: Start (enable): clusvcadm -e vm:foo -m node1 Stop (disable): clusvcadm -d vm:foo Live migrate: clusvcadm -M vm:foo -m node2 Note: clusvcadm == Cluster Service Administrator >> I have *mostly* finished a tutorial using Xen that otherwise does >> exactly what you've described here in much more detail. It's not >> perfect, and I'm still working on the failure-mode testing section, but >> I think it's far enough along to not be useless. >> >> http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial >> >> Even if you don't follow it, a lot of the discussion around reasonings >> and precautions should port well to what you want to do. > > Thanks much. I will read that. > > Best, > Whit Hope it helps. :) -- Digimer E-Mail: digimer at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org