[DRBD-user] The Problem of File System Corruption w/DRBD

Digimer lists at alteeve.ca
Fri Jun 4 00:31:33 CEST 2021

On 2021-06-03 3:35 p.m., Eric Robinson wrote:
>> Even this approach just moves the SPOF up from the FS to the SQL engine.
>> The problem here is that you're still confusing redundancy with data
>> integrity. To avoid data corruption, you need a layer that understands your
>> data at a sufficient level to know what corruption looks like. Data integrity is
>> yet another topic, and still separate from HA.
>> DRBD, and other HA tools, don't analyze the data, and nor should they
>> (imagine the security and privacy concerns that would open up). If the HA
>> layer is given data to replicate, it's job is to faithfully and accurately replicate
>> the data.
> It seems like the two are sometimes intertwined. If GFS2, for example, about integrity or redundancy? But I'm not really asking how to prevent filesystem corruption. I'm asking (perhaps stupidly) the best/easiest way to make a filesystem redundant.

GFS2 coordinates access between nodes, to ensure no two step on each
others blocks and that all know when to update their view of the FS. It
is still above the redundancy layer, it is still just a file system at
the end of the day.

If, for example, you were writing data to an FS on top of DRBD, and one
of the node's local storage started failing, the kernel would (should)
inform the DRBD driver that there has been an IO error. In such a case,
the DRBD device should detach from the local store and go diskless. All
further read/writes on that node would (transparently) go to/from
another node.

In this way, I think, you get as close to the goal you're describing. In
such a case though, you survived a hardware failure, _exactly_ what HA
is all about. You would have no data loss and your managers would be
happy. However, note how this example was below the data structure... It
involved the detection of a hardware fault and mitigation of that fault.

DRBD (like a RAID array) has no concept of data structures. So if
something at the logic layer wrote bad data (ie: a user's deletion or
saving of bad data), DRBD (again, like a RAID array) only cares to
ensure that the data is on both/all nodes, byte for byte accurate. This
is where the role of HA ends, and the role of anti-virus, security and
data integrity / backups kick in.

>> I think the real solution is not technical, it's expectations management. Your
>> managers need to understand what each part of their infrastructure does
>> and does not do. This way, if the concerns around data corruption are
>> sufficient, they can invest in tools to protect the data integrity at the logical
>> layer.
>> HA protects against component failure. That's it's job, and it does it well,
>> when well implemented.
> The filesystem is not a hardware component, but it is a cluster resource. The other cluster resources are redundant, with that sole exception. I'm just looking for a way around that problem. If there isn't one, then there isn't.

Consider the example of a virtual machine running on top of DRBD /
pacemaker (a setup I am very familiar with). If the host hardware fails,
the VM can be preventatively migrated or recovered on the peer node. In
this way, the data was preserved (up to the point of failure / reboot),
and services are restored promptly. This was possible because, byte for
byte the data was written to both host nodes. Voila! Full protection
against hardware faults.

Consider now that your VM gets hit with a cryptolocker virus. That
attack is, faithfully, replicated to both nodes (exactly as it would
replicate to both hard drives in a RAID 1 array). In this case, you're
out of luck. Why? Because HA doesn't protect data integrity, it can't.
It's role is to protect against hardware faults. This is true of the
filesystem inside a VM, or a file system directly on top of a DRBD resource.

The key take-away here is the role of different technologies in your
over-all corporate resilience planning. It's one (very powerful) tool in
a toolbox to protect your services and data. Backups, DR and
anti-malware all play each their own roles in the big-picture planning.

Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

More information about the drbd-user mailing list