[DRBD-user] The Problem of File System Corruption w/DRBD

Fri Jun 4 15:08:36 CEST 2021

> -----Original Message-----
> From: drbd-user-bounces at lists.linbit.com <drbd-user-
> bounces at lists.linbit.com> On Behalf Of Robert Altnoeder
> Sent: Friday, June 4, 2021 6:15 AM
> To: drbd-user at lists.linbit.com
> Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD
>
> On 03 Jun 2021, at 21:41, Eric Robinson <eric.robinson at psmnv.com> wrote:
> >
> > It's a good thing that DRBD faithfully replicates whatever is passed to it.
> However, since that is true, it does tend to enable the problem of filesystem
> corruption taking down a whole cluster. I'm just asking people for any
> suggestions they may have for alleviating that problem. If it’s not fixable,
> then it’s not fixable.
> >
> > Part of the reason I’m asking is because we’re about to build a whole new
> data center, and after 15 years of using DRBD we are beginning to look at
> other HA options, mainly because of the filesystem as a weak point. I should
> mention that it has *never* happened before, but the thought of it is scary.
>
> Oh, you’ve opened that can of worms, one of my favorite topics ;)
>
> I guess, I have bad news for you, because you have only just found the
> entrance to that rabbit hole. There are *lots* of things that can take down
> your entire cluster, and the filesystem is probably the least of your concerns
> here, so I think you’re looking at the wrong thing here. Unfortunately, none
> of them can be fixed by high-availability, because the problem area that you
> are talking about is not high-availability, it’s high-reliability.
>
> Let me give you a few examples on why high-reliability is something
> completely different than high-availability:
>
> 1. Imagine your application ends up in a corrupted state, but keeps running.
> Pacemaker might not even see that - the monitoring possibly just sees that
> the application is still running, so the cluster does not see any need to do
> anything, but the application does not work anymore.
>
> 2. Imagine your application crashes and leaves its data behind in a corrupted
> state in a file on a perfectly good filesystem - e.g., crashes after having
> written only 20% of the file’s content. Now Pacemaker restarts the
> application, but due to the corrupted content in its data file, the application
> cannot start. Pacemaker migrates the application to another node, which
> obviously - due to synchronous replication - has the sama data. The
> application cannot start there. The whole game continues until Pacemaker
> runs out of nodes to try and start the application, because it doesn’t work
> anywhere.
>
> 3. Even worse, there could be a bug hidden in Pacemaker or Corosync that
> crashes the cluster software on all nodes at the same time, so that high-
> availability is lost. Then, your application crashes. Nothing’s there to restart it
> anywhere.
>
> 4. Ultimate worst case: there could be a bug in the Linux kernel, especially
> somewhere in the network or I/O stack, that crashes all nodes
> simultaneously - especially on operations, where all of the nodes are doing
> the same thing, which is not that atypical for clusters - e.g., repliaction to all
> nodes, or distributed locking, etc.
> It’s not even that unlikely.
>
> You might be shocked to hear that it has already happened to me - while
> developing or testing/experimenting, e.g. with experimental code. I have
> even crashed all nodes of an 8 node cluster simultaneously, and not just
> once. I have also had cases where my cluster fenced all its nodes.
> It’s not impossible - BUT it’s also not common on a well-tested production
> system that doesn’t continuously run tests of crazy corner cases like I do on
> my test systems.
>
> Obviously, adding more nodes does not solve any of those problems. But the
> real question is whether your use case is so critical that you really need to
> prevent any of those from occuring once (because those don’t seem to
> happen that often, otherwise we would have heard about it).
>
> If it’s really that level of critical, then you’re running the wrong hardware, the
> wrong operating system and the wrong applications, and what you’re really
> looking for is a custom-designed high-reliability (not just high-availability)
> solution, with dissimilar hardware platforms, multiple independent code
> implementations, formally verified software design and implementation, etc.
> - like the ones used for special purpose medical equipment, safety-critical
> industrial equipment, avionics systems, nuclear reactor control, etc. - you get
> the idea. Now you know why those aren’t allowed run on general-purpose
> hardware and software.
>

Those are all good points. Since the three legs of the information security triad are confidentiality, integrity, and availability, this is ultimately a security issue. We all know that information security is not about eliminating all possible risks, as that is an unattainable goal. It is about mitigating risks to acceptable levels. So I guess it boils down to how each person evaluates the risks in their own environment. Over my 38-year career, and especially the past 15 years of using Linux HA, I've seen more filesystem-type issues than the other possible issues you mentioned, so that one tends to feature more prominently on my risk radar.
Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.