[DRBD-user] The Problem of File System Corruption w/DRBD

Thu Jun 3 21:17:40 CEST 2021

> -----Original Message-----
> From: drbd-user-bounces at lists.linbit.com <drbd-user-
> bounces at lists.linbit.com> On Behalf Of Eddie Chapman
> Sent: Thursday, June 3, 2021 1:11 PM
> To: drbd-user at lists.linbit.com
> Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD
>
> On 03/06/2021 13:50, Eric Robinson wrote:
> >> -----Original Message-----
> >> From: Digimer <lists at alteeve.ca>
> >> Sent: Wednesday, June 2, 2021 7:23 PM
> >> To: Eric Robinson <eric.robinson at psmnv.com>;
> >> drbd-user at lists.linbit.com
> >> Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD
> >>
> >> On 2021-06-02 5:17 p.m., Eric Robinson wrote:
> >>> Since DRBD lives below the filesystem, if the filesystem gets
> >>> corrupted, then DRBD faithfully replicates the corruption to the
> >>> other node. Thus the filesystem is the SPOF in an otherwise
> >>> shared-nothing
> >> architecture.
> >>> What is the recommended way (if there is one) to avoid the
> >>> filesystem SPOF problem when clusters are based on DRBD?
> >>>
> >>> -Eric
> >>
> >> To start, HA, like RAID, is not a replacement for backups. That is
> >> the answer to a situation like this... HA (and other availability
> >> systems like RAID) protect against component failure. If a node
> >> fails, the peer recovers automatically and your services stay online.
> >> That's what DRBD and other HA solutions strive to provide; uptime.
> >>
> >> If you want to protect against corruption (accidental or intentional,
> >> a-la cryptolockers), you need a robust backup system to _compliment_
> >> your HA solution.
> >>
> >
> > Yes, thanks, I've said for many years that HA is not a replacement for
> disaster recovery. Still, it is better to avoid downtime than to recover from it,
> and one of the main ways to achieve that is through redundancy, preferably
> a shared-nothing approach. If I have a cool 5-node cluster and the whole
> thing goes down because the filesystem gets corrupted, I can restore from
> backup, but management is going to wonder why a 5-node cluster could not
> provide availability. So the question remains: how to eliminate the filesystem
> as the SPOF?
> >
>
> Some of the things being discussed here have nothing to do with drbd.
> drbd provides a raw block level device. It knows nothing about nor cares
> what layers you place above it, whether they be filesystems or some other
> block layer such as LVM or bcache.
>
> It does a very specific job; ensure the blocks you write to a drbd device get
> replicated and stored in real time on one or more other distributed hosts. If
> you write a 512byte size block of random garbage to a drbd device it will (and
> should) write the exact same garbage to the other distributed hosts too, so
> that if you read that same 512byte block back from any 1 of those individual
> hosts, you'll get the exact same garbage back.
>
> The OP stated "if the filesystem gets corrupted, then DRBD faithfully
> replicates the corruption to the other node." Good! That's exactly what we
> want it to do. What we definitely do NOT want is for drbd to manipulate the
> block data given to it in any way whatsoever, we want it to faithfully replicate
> this.

No need to defend DRBD. We've been using it in production clusters since 2006 and have been phenomenally happy with it. I'm not indicting DRBD at all. Yes, it's good that it faithfully replicates whatever is passed to it. However, since that is true, it does tend to enable the problem of filesystem corruption taking down a whole cluster. I'm just asking people for any suggestions they may have for alleviating that problem.

-Eric

Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.