[DRBD-user] The Problem of File System Corruption w/DRBD

Fri Jun 4 13:15:02 CEST 2021

On 03 Jun 2021, at 21:41, Eric Robinson <eric.robinson at psmnv.com> wrote:
> 
> It's a good thing that DRBD faithfully replicates whatever is passed to it. However, since that is true, it does tend to enable the problem of filesystem corruption taking down a whole cluster. I'm just asking people for any suggestions they may have for alleviating that problem. If it’s not fixable, then it’s not fixable. 
>  
> Part of the reason I’m asking is because we’re about to build a whole new data center, and after 15 years of using DRBD we are beginning to look at other HA options, mainly because of the filesystem as a weak point. I should mention that it has *never* happened before, but the thought of it is scary.

Oh, you’ve opened that can of worms, one of my favorite topics ;)

I guess, I have bad news for you, because you have only just found the entrance to that rabbit hole. There are *lots* of things that can take down your entire cluster, and the filesystem is probably the least of your concerns here, so I think you’re looking at the wrong thing here. Unfortunately, none of them can be fixed by high-availability, because the problem area that you are talking about is not high-availability, it’s high-reliability.

Let me give you a few examples on why high-reliability is something completely different than high-availability:

1. Imagine your application ends up in a corrupted state, but keeps running. Pacemaker might not even see that - the monitoring possibly just sees that the application is still running, so the cluster does not see any need to do anything, but the application does not work anymore.

2. Imagine your application crashes and leaves its data behind in a corrupted state in a file on a perfectly good filesystem - e.g., crashes after having written only 20% of the file’s content. Now Pacemaker restarts the application, but due to the corrupted content in its data file, the application cannot start. Pacemaker migrates the application to another node, which obviously - due to synchronous replication - has the sama data. The application cannot start there. The whole game continues until Pacemaker runs out of nodes to try and start the application, because it doesn’t work anywhere.

3. Even worse, there could be a bug hidden in Pacemaker or Corosync that crashes the cluster software on all nodes at the same time, so that high-availability is lost. Then, your application crashes. Nothing’s there to restart it anywhere.

4. Ultimate worst case: there could be a bug in the Linux kernel, especially somewhere in the network or I/O stack, that crashes all nodes simultaneously - especially on operations, where all of the nodes are doing the same thing, which is not that atypical for clusters - e.g., repliaction to all nodes, or distributed locking, etc.
It’s not even that unlikely.

You might be shocked to hear that it has already happened to me - while developing or testing/experimenting, e.g. with experimental code. I have even crashed all nodes of an 8 node cluster simultaneously, and not just once. I have also had cases where my cluster fenced all its nodes.
It’s not impossible - BUT it’s also not common on a well-tested production system that doesn’t continuously run tests of crazy corner cases like I do on my test systems.

Obviously, adding more nodes does not solve any of those problems. But the real question is whether your use case is so critical that you really need to prevent any of those from occuring once (because those don’t seem to happen that often, otherwise we would have heard about it).

If it’s really that level of critical, then you’re running the wrong hardware, the wrong operating system and the wrong applications, and what you’re really looking for is a custom-designed high-reliability (not just high-availability) solution, with dissimilar hardware platforms, multiple independent code implementations, formally verified software design and implementation, etc. - like the ones used for special purpose medical equipment, safety-critical industrial equipment, avionics systems, nuclear reactor control, etc. - you get the idea. Now you know why those aren’t allowed run on general-purpose hardware and software.