[DRBD-user] DRBD on Public Cloud VM ?

Wed Mar 18 15:03:23 CET 2020

That should have been sent to the drbd-users list for reference, so here it is...

On 18 Mar 2020, at 13:53, Jérôme Barotin <jbn at s4e.fr> wrote:
> 
> We have two use cases :
> 
> 1 - Storage of about 5GB of small files (10kb in average) that are written and read very often.
> 
> 2 - Archive storage (2TB) of files size from 10kb to 10+ Mb , write and read are more rare and higher latency is not a problem.

Sounds like a typical use case for a simple active/passive single-resource or active/active dual-resource setup, where each resource is active on only one of the nodes at a time, each resource contains a normal filesystem, and each resource is replicated to the other site. In case of a downtime of one VM, the resource will fail over to the other node/site.

> I'm trying to set up a test environment, I'm using Ubuntu server as distribution, is that a correct choice ? Or Red Hat based distrib would be easier to work with ?

It works essentially identically with both.

> Also, it's not clear how to make the link between  GFS, Corosync / Pacemaker and DRBD. Where could I find some good doc to understand what I'm doing ?

GFS is not necessary in the setup described above. There is the DRBD User’s Guide, available on the LINBIT homepage, and there are guides for setting up Pacemaker, available on the clusterlabs.org homepage.

> Thanks for your reply Robert, exactly what I'm thinking, but, my management team directives are clear : no baremetal, I have to use only cloud solution.

It depends on the exact requirements whether or not a cluster can reasonably run across different sites. E.g., if you have simple client software that needs to find a service under a single IP address, then your service needs a service IP address that can be placed on each node in the cluster, which is not possible if the cluster nodes are in different IP subnets. And things become really complicated when people try to handicraft some workarounds, like VPNs, Proxies, etc. to work around those limitations.

The reality is that most operators struggle even with running simple active/passive clusters, at least as soon as some minor unexpected problem occurs in the cluster, and a lot of downtime is not due to failed hardware or power outages, but rather due to the operators’ inability to figure out why their software setup is not failing over properly, or is not starting/not working. Every piece of complexity on top of that just makes the situation worse.

br,
Robert