[DRBD-user] DRBD cluster and updating system

Fri Aug 30 13:28:15 CEST 2019

On 8/29/19 9:46 AM, Dušan Maček wrote:
> Dear All,
>
> I need some advice regarding cluster system update. I've built a
> cluster in a hope of zero downtime, but unfortunately it doesn't work
> this way.

Actual "zero downtime" is unrealistic anyway, especially with COTS hard-
and software. The only systems that are somewhat close to achieving no
downtime at all are custom-designed hardware/software combinations that
are highly specialized for the job.

A cluster of general purpose hard- and software helps you keep a
downtime shorter in many scenarios - e.g., hardware failure, OS crash,
most software crashes, things like that. Some scenarios remain where the
cluster will not be able to prevent a downtime, typically if some
service stops working correctly, but its process does not terminate, so
to Pacemaker's monitoring the process would seem to be running normally.
Such scenarios require operator intervention, and the downtime is
determined by how quickly the operators can diagnose and fix the problem.
Pre-defined standard operating procedures and a highly trained 24/7
staff of system operators can help reduce downtime in those cases.

To summarize: High availability is not simply a product that can be
installed, it is much rather the result of using the right tools for the
job, putting the right processes in place and ensuring 24/7 availability
of highly trained system operators.

> Now the main question : what are your experiences with system upgrades
> in cluster environment ? How to avoid downtime ?

The typical process for upgrading software on a cluster is to upgrade
the hardware or software on the standby system, then make sure that the
standby system is able to take over (e.g., is resynced with the active
system), and then migrate cluster resources to the standby system, so
that the other node can be upgraded.
The migration itself however causes a short downtime, because most
services will need to be stopped on the active node and restarted on the
standby node.
That's what most people do, and in our experience, it works alright in
most cases.

You will however lose high availability during the upgrade process,
because your standby system will not be ready to run services while you
are upgrading it.

If high availability should even be maintained while upgrading, then at
least a 3 node cluster is required, provided that your software is
downwards compatible as well as upwards compatible. Once you begin
running the upgraded software, your only standby nodes still have the
old software, so if your upgraded node fails before you had a chance to
upgrade another node, you have another downtime if your software is not
upwards-compatible.

To prevent that scenario too, you would at least need a 4 node cluster,
so that you can have 2 nodes providing HA services while you are
upgrading the other 2 nodes, and then you could fail over to 2 upgraded
nodes that could provide HA services using upgraded software.

Ideally, the cluster would also be able to rely on a quorum for
decision-making at all times, so a cluster of at least 5 nodes would be
even better.

This is about as close to zero downtime as you can get with COTS hard-
and software:
- Run a 5 node cluster
- Make use of quorum
- Have properly working fencing mechanisms
- Upgrade only one node at a time
- Upgrade two (or three) nodes before failing over to a node with
upgraded software
- Then upgrade the remaining nodes one by one until all nodes are upgraded

There will still be a short downtime during the failover.

That being said, the biggest risk for downtime is probably operator
error at this point, due to the high complexity of such an upgrade
process (e.g., managing the cluster correctly during the upgrade process).

br,
Robert