Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Digimer, I really thank you for your long-form discussion. So much of the writing on this stuff is terse, making for a steep learning curve. > You should be using Clustered LVM (clvmd). This way the LVM PV/VG/LVs > are in sync across both nodes at all times. I'm not yet convinced why I should use clvmd. I'm not afraid of creating matching PV/VG/LVs by hand. It's easy to get those to match, and nothing that's run post setup is altering their scheme. On the KISS principle, since I'm capable enough of being stupid, I stick with the tools I know - in this case plain LVM - unless the argument for introducing something new is decisive. I've read some past discussion here about clvmd being required or not, and it seemed to lean against the requirement. With each VM being on a DRBD pair of two dedicated LV's (just for the one VM), I just don't see what can get confused on this level. Am I missing something? > Running the same VM on either host is suicidal, just don't, ever. To > help prevent this, using 'resource-and-stonith' and use a script that > fires a fence device when a split-brain occurs, then recover the lost > VMs on the surviving node. Further, your cluster resource manager > (rgmanager or pacemaker) should themselves require a successful fence > before beginning resource recovery. Yeah, I definitely have to either get a better hold on the logic of pacemaker, or write my own scripts for this stuff. These servers have IPMI. It would be simple in a bad state to be sure the replication link is dropped. Since the IPMI is on the LAN side, if one server loses sight of the other on both replication and LAN links, then it should be safe to send the other a shutdown message over IPMI given that the other, no longer being on the LAN, shouldn't be able to send the same message back at it at the same time. I think. Then the only other logic needed, aside from firing appropriate notices to staff, is to start the list of VMs normally run on the down host. Am I making a beginner's mistake to think this can be kept so simple: If both links test dead for the other system, shut it down by IPMI, start up the VMs it was responsible for running, send notices, and we're done. Now, it would be good on restarting the other machine to have it recognize it shouldn't fire up all its usual VMs, so there's more logic needed to be ready for that event. But the initial failover looks simple. Pacemaker looks overly complex and opaque - or more likely I don't understand yet how simple it would be to set it up for this, as I'm getting lost among all it's other options. It's not much to script from scratch though, if it's as simple as it looks in my sketch. > Fencing (stonith) generally defaults to "restart". This way, with a > proper setup, the lost node will hopefully reboot in a healthy state, > connect to the DRBD resources and resync, rejoin the cluster and, if you > configure it to do so, relocate the VMs back to their original host. > Personally though, I disable automatic fail-back so that I can determine > the fault before putting the VMs back. Hmm, restart rather than shut down. I take it there's a standard way to have that come back up without doing its normal start of its VMs, but instead to initialize a live migration of them back, just if the system comes up well? > Regardless, properly configured cluster resource manager should prevent > the same VM running twice. ... > That said, a properly configured resource manager can be told that > service X (ie: a VM), is only allowed to run on one node at a time. > Then, should a user try to start it a second time, it will be denied by > the resource manager. Ah, but is there a way to surrender control on this level to pacemaker and still do on-the-fly live migrations from virsh and virt-manager, or for that matter on-the-fly startups and shutdowns of VMs, without pacemaker causing the second host to react in any way? Having the flexibility to do ad hoc operations by hand is important to me - just a shade less important than having dependable failover and usable manual failback, nonetheless a high priority. > I have *mostly* finished a tutorial using Xen that otherwise does > exactly what you've described here in much more detail. It's not > perfect, and I'm still working on the failure-mode testing section, but > I think it's far enough along to not be useless. > > http://wiki.alteeve.com/index.php/Red_Hat_Cluster_Service_2_Tutorial > > Even if you don't follow it, a lot of the discussion around reasonings > and precautions should port well to what you want to do. Thanks much. I will read that. Best, Whit