[DRBD-user] DRBD 9.x sync corruption when using resync-after

Maunder, Shaun Shaun.Maunder at ncr.com
Tue Jul 26 13:17:05 CEST 2022


We have narrowed the problem down to writes being issued into the DRBD device whilst the peer is in
the Outdated state.

There is an updated reproducer script which removes the blktrace/fio replay requirements.

Script link: https://gist.github.com/ShMaunder/25c6483a8cf29312d5ca409f99466090

The script performs the following steps:

1. Down the volume-1 and volume-2 DRBD resources.

2. Recreate backing & metadata volumes for volume-1 and volume-2. 
   Create new metadata, bring up the resources and make sure they connect.
   Set a new UUID to skip a full resync and primary the resources on one of the nodes.

3. Disconnect the resources.
   Inject ~250MB of random data into volume-1. ~250MB prevents an instant resync of volume-1.
   Inject a sector of random data into volume-2. Only a single sector to trigger the resync.

4. Connects volume-1 and volume-2 DRBD resources.
   Explicitly pause sync on volume-2.
   Volume-2 will stay in the Outdated state even after volume-1 has resync'd.
   
5. Inject another sector of random data into volume-2.
   This sector won't resync and nor will any other changed sectors not already marked for resync.

6. The secondary side waits for the primary side to reconnect its resources.

7. Resume sync of volume-2 and wait till the sync has completed.
   Check for DRBD's resync warning message in the journal.
   
8. Verify whether volume-1 and volume-2 backing devices are in sync via checksum & drbdadm verify.


Some analysis shows metric inconsistencies between the nodes while volume-2 is in the Outdated
state. For example, the rs_total counter seems to be misaligned between the nodes and increments
only on the primary side as more data is changed.

I have found that the data changes which are not sync'd from step 5 can be quick sync'd by:

a. Disconnect volume-2.

b. Inject another random sector into volume-2.

c. Connect volume-2.

d. Both the random sector and all the changes from step 5 are quickly sync'd.


More information about the drbd-user mailing list