[DRBD-user] verify/disconnect/connect doesn't resync?

Wed Oct 6 13:03:48 CEST 2021

On 29/09/2021 01:10, Chris Pacejo wrote:
> Hi, I have a three-node active/passive DRBD cluster, operating with default configuration.  I had to replace disks on one of the nodes (call it node A) and resync the cluster.
> 
> Somehow, after doing this, A was not in sync with the primary (node C); I only discovered this because I couldn't even mount the filesystem on it after (temporarily) making A primary.  I don't fully understand how I got into this situation but that's a tangent for now.
> 
> Following instructions in the documentation, I enabled a verification algorithm, and instructed A to verify (`drbdadm verify <my volume>`).  It correctly found many discrepancies (gigabytes worth!) and emitted the ranges to dmesg.
> 
> I then attempted to resynchronize A with C (the primary) by running `drbdadm disconnect <my volume>` and then `drbdadm connect <my volume>`, again, following documentation.  This did not appear to do anything, despite verify having just found nearly the entire disk to be out of sync.  Indeed, running verify a second time produced the exact same results.
> 
> Instead I forced a full resync by bringing A down, invalidating it, and bringing it back up again.  Now verification showed A and C to be in sync.

What I usually do in this situation (I believe it's because no writes 
have hit the primary while disconnected), to avoid the drastic step of 
having to completely invalidate a secondary node, is: disconnect the 
secondary, force a tiny change on the primary (e.g. touch and delete an 
empty file on the filesystem, run a filsystem check which updates the fs 
metadata), then reconnect. Of course this forces a resync and, in my 
experience and from what I can tell by the number of Kbs resynced, the 
resync includes the verified blocks found out of sync).

> 
> However A was still showing a small number (thousands) of discrepancies with node B (the other secondary node).  So I repeated the above steps on B -- verify/disconnect/connect/verify -- and again, nothing changed.  B still shows discrepancies between it and both A and C.
> 
> Running the same steps on node C (the primary) again found discrepancies with B, and again failed to resynchronize.
> 
> What am I missing?  Is there an additional step needed to convince DRBD to resynchronize blocks found to mismatch during verify?
> 
> Further questions --
> 
> Why does `drbdadm status` not show whether out-of-sync blocks were found by `drbdadm verify`?  Instead it shows UpToDate like nothing is wrong.
> 
> Why is resynchronization only triggered on reconnect?  Is there a downside to simply starting resynchronization when out-of-sync blocks are discovered?

I believe this has just been left for the user to take whatever action 
is desired using the out-of-sync helper. I suppose some people might not 
want any automatic action taken and just have a helper script send them 
a notification so they can manually intervene.

Eddie

> 
> Version info:
> DRBDADM_BUILDTAG=GIT-hash:\ 5acfd06032d4c511c75c92e58662eeeb18bd47db\ build\ by\ ec2-user at test-cluster-c.cpacejo.test\,\ 2021-07-06\ 20:48:54
> DRBDADM_API_VERSION=2
> DRBD_KERNEL_VERSION_CODE=0x090102
> DRBD_KERNEL_VERSION=9.1.2
> DRBDADM_VERSION_CODE=0x091200
> DRBDADM_VERSION=9.18.0
> 
> dmesg logs below.
> 
> Thanks,
> Chris

<snip>