[DRBD-user] drbd-9.0.26

Wed Dec 23 07:41:52 CET 2020

On 2020-12-22 5:43 a.m., Philipp Reisner wrote:
> Dear DRBD users,
> 
> This is a big release. The release candidate phase lasted more than a
> month.  Bug reports and requests were coming in concurrently from different
> customers/users working on different use-cases and scenarios.
> 
> One example: the XCP-ng driver developers need to switch all nodes quickly
> for a short time to primary, right after the initial resync started. Nobody
> else does that, so they uncovered an issue.
> 
> Another one: KVM on DRBD on ZFS zVols. We learned the hard way that the
> guest within KVM might issue read requests with a size of 0 (zero!). I guess
> that is used for discovery, maybe a SCSI scan. The size 0 read is processed
> by DRBD, but older versions of ZFS react with a kernel OOPS!
> 
> The most important two fixes are those that address possible sources of data
> corruption. Both were reported by a cloud provider from China. Apparently,
> they have a fresh way of testing, so they were able to identify these issues
> AND EVEN SUGGESTED PATCHES!
> 
> One is about write-requests that come in on a primary while it is in the
> process of starting a partial/bitmap-based resync (repl: WFBitmapS). Those
> write-requests might not get mirrored.  The bug can happen with just two
> nodes, although more nodes probably increase the likelihood that it happens,
> The volume needs to be a bit bigger because a small bitmap reduces the
> likelihood to hit it. Expect a tight-loop test to run for multiple hours to
> trigger it once.
> There is a whole story behind it. Many years ago DRBD simply blocked
> incoming write-requests during that state. Then we had to optimize DRBD for
> 'uniform write latencies' and allowed write-requests to proceed while it is
> in WFBitmapS state, and introduced an additional packet to send late bitmap
> updates in this state. Later came other changes, related to state handling
> that finally opened the window for this bug.
> 
> The second bug in this category requires 3 nodes or more. It requires a
> resync between two nodes, and that the 3rd node is primary and only
> connected to the sync source of the other two. Again you need to do a lot of
> IO on the primary, a fast resync and then it can happen that a few bits are
> missing in the primary towards node 3. This can lead to a later resync from
> the primary to the third node missing these blocks.
> 
> Bugs are bad. Those that can cause inconsistencies in the mirror
> especially. One way to maneuver a production system beyond this is by using
> the online-verify mechanism to find out if your DRBD resources are subject
> to this. It also sets the bits for the blocks it finds out of sync. Get in
> touch with us via support, on the community-slack channel, or on the mailing
> list in case you are affected.
> 
> I recommend everyone to upgrade any drbd-9 to 9.0.26.
> 
> 
> 9.0.26-1 (api:genl2/proto:86-118/transport:14)
> --------
>  * fix a source of possible data corruption; related to a resync and
>    a primary node that is connected to the sync-source node only
>  * fix for writes not getting mirrored over a connection while the primary
>    transitions through the WFBitMapS state
>  * complete size 0 reads immediately; some workloads (KVM and
>    iscsi targets) in combination with a ZFS zvol as the backend can lead to
>    a kernel OOPS in ZFS; this is a workaround in DRBD for that
>  * fix a crash if during resync a discard operation fails on the
>    resync-target node
>  * fix a case of a disk unexpectedly becoming Outdated by moving the
>    exchange of the initial packets into the body of the two-phase-commit
>    that happens at a connect
>  * fix for sporadic "Clearing bitmap UUID for node" log entries;
>    a potential source of problems later on leading to false split-brain
>    or unrelated data messages.
>  * retry connect properly in case of bitmap-uuid changes during the handshake
>  * completed missing logic of the new two-phase-commit based connect process;
>    avoid connecting partitions with a primary in each; ensure consistent
>    decisions if the connect attempt will be retried
>  * fix an unexpected occurrence of NetworkFailure state in a tight
>    drbdsetup disconnect; drbdsetup connect sequence
>  * fix online verify to return to Established from VerifyS if the VerifyT node
>    was temporarily Inconsistent during the run
>  * fix a corner case where a node ends up Outdated after the crash and rejoin
>    of a primary node
>  * pause a resync if the sync-source node becomes inconsistent; an example
>    is a cascading resync where the upstream resync aborts and leaves the
>    sync-source node for the downstream resync with an inconsistent disk;
>    note, the node at the end of the chain could still have an outdated disk
>    (better than inconsistent)
>  * reduce lock contention on the secondary for many resources; can improve
>    performance significantly
>  * fix online verify to not clamp disk states to UpToDate
>  * fix promoting resync-target nodes; the problem was that it could modify
>    the bitmap of an ongoing resync; which leads to alarming log messages
>  * allow force primary on a sync-target node by breaking the resync
>  * fix adding of new volumes to resources with a primary node
>  * reliably detect split brain situation on both nodes
>  * improve error reporting for failures during attach
>  * implement 'blockdev --setro' in DRBD
>  * following upstream changes to DRBD up to Linux 5.10 and ensure
>    compatibility with Linux 5.8, 5.9, and 5.10
> 
> 
> https://www.linbit.com/downloads/drbd/9.0/drbd-9.0.26-1.tar.gz
> https://github.com/LINBIT/drbd/commit/8e0c552326815d9d2bfd1cfd93b23f5692d7109c

Thanks for this release! We've just updated and will report back if we
have any issues.

Cheers!

digimer

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould