[DRBD-user] Parallel resource startup, scalability questions

Wed Jul 3 07:31:40 CEST 2013

On Tue, 2 Jul 2013 22:15:06 +0200 Arnold Krille wrote:

Hallo,

> Hi,
> 
> On Tue, 2 Jul 2013 17:08:30 +0900 Christian Balzer <chibi at gol.com>
> wrote:
> > not purely a DRBD issue, but it sort of touches most of the bases, so
> > here goes.
> > 
> > I'm looking at deploying a small (as things go these days) cluster (2
> > machines) for about 30 KVM guests, with DRBD providing the storage.
> > 
> > My main concern here is how long it would take to fail-over (restart)
> > all the guests if a node goes down. From what I gathered none of the
> > things listed below do anything in terms of parallelism when it comes
> > to starting up resources, even if the HW (I/O system) could handle it.
> <snip>
> > Lastly I could go with Pacemaker, as I've done in the past for much
> > simpler clusters, but I really wonder how long starting up those
> > resources will take. If I forgo live-migration I guess I could just
> > do one DRBD backing resource for all the LVMs. But still, firing up
> > 30 guests in sequence will take considerable time, likely more than I
> > would consider really a "HA failover" level of quality.
> 
> Why do the vms have to start in sequence?
> 
They don't have to at all, it just seemed (again from experience of
clusters with very few resources) to me that they would.

> Pacemaker happily starts several services in parallel provided they
> don't depend on each other. And you have to define these dependencies
> as orders/groups yourself. Otherwise pacemaker assumes that services
> are to be startet in parallel. (At least thats what I see here when
> booting my 2+1 node cluster from cold.)
> 
Thanks for that valuable input. ^.^

> I don't have 30 vms, more like 15. But at least one drbd-volume for
> each machine. And dependencies defined so the ldap-server has to be up
> before the others start that need it.
> 
> And using individual drbd-resources for the machines might be a bit
> more to set-up when doing it all at once (my setup has grown over
> time), it allows to distribute the vms on the two nodes, so the vms
> don't need to run all on one node. And when you also define scores for
> importance, you can over-commit the memory (and cpu) of the two nodes
> so that normally everything runs and only in case of a node-failure
> some not so importent vms are stopped/not started.
>
That's something I was planning to do up to a point.
I wonder what you're using to manage/create the VMs, the LCMC GUI is a bit
limited compared to virt-manager or Proxmox when it comes to details like
CPU pinning, which I will need.
OTOH one can always put the config directory onto an OCFS2 device and use
whatever tool is best w/o having to worry about cluster-wide replication.

> One the other hand, if I had to start all over again (without a
> deadline within the next two weeks), I would look at ceph or sheepdog
> for storage and either use pacemaker to manage the vms or take a look
> at openstack's ha-support.
> 
Ganeti supports (with big flashing warning lights though) ceph/RBD as well.
While there is a possibility of this project growing past one pair of
machines, I'm wondering if ceph is worth it until you hit some level where
separation of computing nodes and storage nodes (and all the costly
infrastructure that entails) makes sense.
Having computing and storage in the same box and using DRBD is likely to
give the best performance, I don't think there is a way to ensure with
ceph that reads are always local.

I've run the numbers in the past, you can buy quite a number of paired
machines (local RAID, direct Infiniband interconnect) before building
dedicated storage nodes and VM host nodes with a redundant storage
network (pair of Infiniband switches for starters) becomes cheaper. 

> Have fun,
> 
I will, undoubtedly. ^o^

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/