[DRBD-user] What to do about read errors on the primary?

Fri Sep 21 10:30:16 CEST 2012

On Tue, Sep 18, 2012 at 10:05:31AM -0600, Alan Robertson wrote:
> There was another note mentioning backups...
> DRBD is designed to protect against server and disk failures.  Backups
> primarily protect against human errors, disasters and so on - and I do
> have backups...     Snarky comments aren't very helpful and don't have
> much place in civil discourse except maybe with your friends.   The fact
> that you don't want your system to recover from I/O errors is your
> choice.  I'm funny that way -  I want my system to do all it can to
> recover from problems, and minimize data loss...
> 
> In this case, I have a disk failure which I am having trouble getting
> DRBD to protect me against.  I'm perfectly willing to accept that I
> should have configured things differently - which would be why I came
> here asking for help.  In the 10+ years I've been using and recommending
> DRBD, it's never come up for me before.
> 
> On 09/18/2012 05:16 AM, Lars Ellenberg wrote:
> >
> > Alan Robertson <alanr at unix.sh> schrieb:
> >
> >> I have read errors on the primary side, which caused the secondary to
> >> go
> >> into an "inconsistent" state.  This means that the disk which
> >> desperately needs backing up, is no longer being backed up (!).
> >>
> >> In an ideal world, it seems to me what one would like for DRBD to do
> >> would be:
> >>    get the data from the secondary
> >>    write it to the primary - which often "fixes" read errors
> >>    continue on syncing everything else to the secondary
> > Well, we don't do this yet.
> > We detach the faulty disk, and resync when you reattach.
> >
> > Platform, kernel version, drbd version, configuration and logs...
> >      ;-)
> I actually figured you'd just tell me what I needed to change - so I
> didn't go grab them the first time.  Nevertheless - mea culpa... ;-)

Well, yes, I could, based on educated guesses as to your configuration.

I would have guessed that you did not configure "on-io-error detach;",
but you should have.

Just because something is a default setting does not mean it is
sensible. The old "default" setting unsuitably labeled "pass-on"
is not particularly useful.

So yes, that's what you should have changed.

So for now, you better have some RAID below DRBD.
hard or soft (md), any redundant raid level is fine.

Because, if your SyncSource fails during resync,
you are out-of-luck.

With sufficient context information, if you know what you are doing,
and given specific failure modes, you then may be able to fix this by hand.

On such theoretically fixable failure mode would be read errors on
different sectors, while the respective sector on the other node still
has the "correct" data (and you are sure that it is still the correct
data).

But no, DRBD does not do any such "advanced" fixing of multiple
"simultaneously" failing replicas, yet.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed