[DRBD-user] Warning: If using drbd; data-loss / corruption is possible; PLEASE FIX IT !

Wed Aug 16 12:10:48 CEST 2017

On Mon, Aug 14, 2017 at 10:09:06PM +0200, "Thomas Brücker" wrote:
> Dear DRBD-Developers, dear DRBD-Users,
> 
> Actually I would be very fond of DRBD -- But unfortunately I had somtimes
> data-losses (rarely, but I had them).
> 
> FOR DEVELOPERS AND USERS:
> 
> DRBD-Versions concerned: 9.0.7-1, 9.0.8-1, 9.0.9rc1-1 . "THE VERSIONS"
> 
> I think the following configuration options are mandatory to have these
> data losses:
> net {
>     congestion-fill  "1";    # 1 sector
>     on-congestion    "pull-ahead";
>     protocol         "A";
>     [... (other options)]
> }
> (the goal of these settings: a very slow network-connection should not
>  slow down the local disk-io.)

While that is a commendable goal, even without bugs,
this does not do what you apparently think it does.

"pull ahead" is an option that is really only useful
when using the DRBD proxy, the buffered ("in flight") data
will be several 100 MB to several GB, congestion-fill would be
~ 80% (or more) of that buffer, and it would take seconds to minutes
to drain the already queued buffer before changing to resync
and then to normal replication.

Even then, the "pull ahead" is considered an emergency break only,
and certainly not something that is supposed to happen often.

Your configuration basically tells DRBD to "pull ahead" for *each*
write request, then "immediately" start a resync, while the next
write-request already jumps to "ahead" again.

Does not make sense, and probably DRBD should just refuse such
configuration.  You are using it "out of spec", basically,
and it is very plausible that you hit some bugs when doing so.

That being said, even then DRBD should, once idle, eventually reach a
point where all replicas are identical again.

If you care for two-node scenarios only,
DRBD 8.4 may or may not behave better with pull-ahead,
but the comment above still applies, about "ahead" mode being intended,
and being only really useful, in conjunction with DRBD proxy.

> * Supposed Explanation:

Thank you.

> I am longing for a perfectly working DRBD,

Don't we all.

Still, it would not do what you apparently think it would.

"pulling ahead" means that we don't send the date over anymore,
but only the "LBA numbers" of changed blocks when they change first.
And that, once the "congestion" is considered to be over,
we start a resync.

Which means the peer becomes sync target.

If you pull ahead "very frequently",
you keep your peer between "behind" and "sync target",
it won't really have the chance to actually catch up.

A sync target is (necessarily, by design) Inconsistent.
Inconsistent means you have a mix of old and new blocks.
Inconsistent data is unusable.

If you "catastrophically" lose your main data copy,
and you are left with an only inconsistent remote copy,
because the peer constantly changed between "behind" and "sync target",
you still need to find your latest consistent backup.

DRBD has the "before resync-target" handler to at least try to
"snapshot" the latest consistent version of the data before becoming
inconsistent to mitigate that.

Still, "constantly" cycling between
Connected
While not idle for long enough
	Ahead/Behind,
	SyncSource/SyncTarget
is a bad idea.

If you want snapshot shipping,
use a system designed for snapshot shipping.
DRBD is not.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed