[DRBD-user] The Problem of File System Corruption w/DRBD

Thu Jun 3 21:41:05 CEST 2021

I guess I need to reiterate that I’ve been using DRBD in production clusters since 2006 and have been extremely satisfied happy with it. The purpose of my question is not to cast doubt or blame on DRBD for doing its job well. It's a good thing that DRBD faithfully replicates whatever is passed to it. However, since that is true, it does tend to enable the problem of filesystem corruption taking down a whole cluster. I'm just asking people for any suggestions they may have for alleviating that problem. If it’s not fixable, then it’s not fixable.

Part of the reason I’m asking is because we’re about to build a whole new data center, and after 15 years of using DRBD we are beginning to look at other HA options, mainly because of the filesystem as a weak point. I should mention that it has *never* happened before, but the thought of it is scary.

-Eric

From: drbd-user-bounces at lists.linbit.com <drbd-user-bounces at lists.linbit.com> On Behalf Of Yanni M.
Sent: Thursday, June 3, 2021 2:21 PM
Cc: drbd-user at lists.linbit.com
Subject: Re: [DRBD-user] The Problem of File System Corruption w/DRBD

As others already mentioned the job of DRBD is to faithfully and accurately replicate the data from the layers above it. So if there's a corruption on the filesystem above the DRBD layer then it will happily do it for you, same way as RAID1  would do it on a pair of hdds. If you want to reduce the recovery time from such situation then you could leverage from the snapshots capability on the layers below DRBD (if ThinLVM or ZFS are used), to rollback at a previous checkpoint or implement HA at the layers above DRBD if the application you are using supports it, it really depends on the use case. That being said a filesystem corruption shouldn't be a common thing and if it occurs you should investigate why it happened in the first place.

On Wed, 2 Jun 2021 at 22:50, Eric Robinson <eric.robinson at psmnv.com<mailto:eric.robinson at psmnv.com>> wrote:
Since DRBD lives below the filesystem, if the filesystem gets corrupted, then DRBD faithfully replicates the corruption to the other node. Thus the filesystem is the SPOF in an otherwise shared-nothing architecture. What is the recommended way (if there is one) to avoid the filesystem SPOF problem when clusters are based on DRBD?

-Eric

Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user at lists.linbit.com<mailto:drbd-user at lists.linbit.com>
https://lists.linbit.com/mailman/listinfo/drbd-user
Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20210603/08b3facf/attachment.htm>