[DRBD-user] drbd-9.0.26
Digimer
lists at alteeve.ca
Wed Dec 23 07:41:52 CET 2020
On 2020-12-22 5:43 a.m., Philipp Reisner wrote:
> Dear DRBD users,
>
> This is a big release. The release candidate phase lasted more than a
> month. Bug reports and requests were coming in concurrently from different
> customers/users working on different use-cases and scenarios.
>
> One example: the XCP-ng driver developers need to switch all nodes quickly
> for a short time to primary, right after the initial resync started. Nobody
> else does that, so they uncovered an issue.
>
> Another one: KVM on DRBD on ZFS zVols. We learned the hard way that the
> guest within KVM might issue read requests with a size of 0 (zero!). I guess
> that is used for discovery, maybe a SCSI scan. The size 0 read is processed
> by DRBD, but older versions of ZFS react with a kernel OOPS!
>
> The most important two fixes are those that address possible sources of data
> corruption. Both were reported by a cloud provider from China. Apparently,
> they have a fresh way of testing, so they were able to identify these issues
> AND EVEN SUGGESTED PATCHES!
>
> One is about write-requests that come in on a primary while it is in the
> process of starting a partial/bitmap-based resync (repl: WFBitmapS). Those
> write-requests might not get mirrored. The bug can happen with just two
> nodes, although more nodes probably increase the likelihood that it happens,
> The volume needs to be a bit bigger because a small bitmap reduces the
> likelihood to hit it. Expect a tight-loop test to run for multiple hours to
> trigger it once.
> There is a whole story behind it. Many years ago DRBD simply blocked
> incoming write-requests during that state. Then we had to optimize DRBD for
> 'uniform write latencies' and allowed write-requests to proceed while it is
> in WFBitmapS state, and introduced an additional packet to send late bitmap
> updates in this state. Later came other changes, related to state handling
> that finally opened the window for this bug.
>
> The second bug in this category requires 3 nodes or more. It requires a
> resync between two nodes, and that the 3rd node is primary and only
> connected to the sync source of the other two. Again you need to do a lot of
> IO on the primary, a fast resync and then it can happen that a few bits are
> missing in the primary towards node 3. This can lead to a later resync from
> the primary to the third node missing these blocks.
>
> Bugs are bad. Those that can cause inconsistencies in the mirror
> especially. One way to maneuver a production system beyond this is by using
> the online-verify mechanism to find out if your DRBD resources are subject
> to this. It also sets the bits for the blocks it finds out of sync. Get in
> touch with us via support, on the community-slack channel, or on the mailing
> list in case you are affected.
>
> I recommend everyone to upgrade any drbd-9 to 9.0.26.
>
>
> 9.0.26-1 (api:genl2/proto:86-118/transport:14)
> --------
> * fix a source of possible data corruption; related to a resync and
> a primary node that is connected to the sync-source node only
> * fix for writes not getting mirrored over a connection while the primary
> transitions through the WFBitMapS state
> * complete size 0 reads immediately; some workloads (KVM and
> iscsi targets) in combination with a ZFS zvol as the backend can lead to
> a kernel OOPS in ZFS; this is a workaround in DRBD for that
> * fix a crash if during resync a discard operation fails on the
> resync-target node
> * fix a case of a disk unexpectedly becoming Outdated by moving the
> exchange of the initial packets into the body of the two-phase-commit
> that happens at a connect
> * fix for sporadic "Clearing bitmap UUID for node" log entries;
> a potential source of problems later on leading to false split-brain
> or unrelated data messages.
> * retry connect properly in case of bitmap-uuid changes during the handshake
> * completed missing logic of the new two-phase-commit based connect process;
> avoid connecting partitions with a primary in each; ensure consistent
> decisions if the connect attempt will be retried
> * fix an unexpected occurrence of NetworkFailure state in a tight
> drbdsetup disconnect; drbdsetup connect sequence
> * fix online verify to return to Established from VerifyS if the VerifyT node
> was temporarily Inconsistent during the run
> * fix a corner case where a node ends up Outdated after the crash and rejoin
> of a primary node
> * pause a resync if the sync-source node becomes inconsistent; an example
> is a cascading resync where the upstream resync aborts and leaves the
> sync-source node for the downstream resync with an inconsistent disk;
> note, the node at the end of the chain could still have an outdated disk
> (better than inconsistent)
> * reduce lock contention on the secondary for many resources; can improve
> performance significantly
> * fix online verify to not clamp disk states to UpToDate
> * fix promoting resync-target nodes; the problem was that it could modify
> the bitmap of an ongoing resync; which leads to alarming log messages
> * allow force primary on a sync-target node by breaking the resync
> * fix adding of new volumes to resources with a primary node
> * reliably detect split brain situation on both nodes
> * improve error reporting for failures during attach
> * implement 'blockdev --setro' in DRBD
> * following upstream changes to DRBD up to Linux 5.10 and ensure
> compatibility with Linux 5.8, 5.9, and 5.10
>
>
> https://www.linbit.com/downloads/drbd/9.0/drbd-9.0.26-1.tar.gz
> https://github.com/LINBIT/drbd/commit/8e0c552326815d9d2bfd1cfd93b23f5692d7109c
Thanks for this release! We've just updated and will report back if we
have any issues.
Cheers!
digimer
--
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
More information about the drbd-user
mailing list