[DRBD-user] Warning: Data Corruption Issue Discovered in DRBD 8.4 and 9.0

Lars Ellenberg lars.ellenberg at linbit.com
Fri Oct 13 16:04:35 CEST 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


First, too "all of you",
if someone has some spare hardware and is willing to run the test as
suggested by Eric, please do so.
Both "no corruption reported after X iterations" and "corruption
reported after X iterations" is important feedback.
(State the platform and hardware and storage subsystem configuration
and other potentially relevant info)

Also, interesting question: did you run your non-DRBD tests on the
exact same backend
(LV, partition, lun, slice, whatever),
or on some other "LV" or "partition" on the "same"/"similar" hardware?

Now,
"something" is different between test run with or without DRBD.

First suspect was something "strange" happening with TRIM, but you
think you can rule that out,
because you ran the test without trim as well.

The file system itself may cause discards (explicit mount option
"discard", implicit potentially via mount options set in the
superblock), it does not have to be the "fstrim".

Or maybe you still had the fstrim loop running in the background from
a previous test,
or maybe something else does an fstrim.

So we should double check that, to really rule out TRIM as a suspect.

You can disable all trim functionality in linux by
echo 0 > /sys/devices/pci0000:00/0000:00:01.1/ata2/host1/target1:0:0/1:0:0:0/block/sr0/queue/discard_max_bytes
(or similar nodes)

something like this, maybe:
echo 0 | tee  /sys/devices/*/*/*/*/*/*/block/*/queue/discard_max_bytes

To have that take effect for "higher level" or "logical" devices,
you'd have to "stop and start" those,
so deactivate DRBD, deactivate volume group, deactivate md raid,
then reactivate all of it.

double check with "lsblk -D" if the discards now are really disabled.

then re-run the tests.


In case "corruption reported" even if we are "certain" that discard is
out of the picture,
that is an important data point as well.

What changes when DRBD is in the IO stack?
Timing (when does the backend device see which request) may be changed.
Maximum request size may be changed.
Maximum *discard* request size *will* be changed,
which may result in differently split discard requests on the backend stack.

Also, we have additional memory allocations for DRBD meta data and housekeeping,
so possibly different memory pressure.

End of brain-dump.


-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT



More information about the drbd-user mailing list