Usynced blocks if replication is interrupted during initial sync

Tim Westbrook Tim_Westbrook at selinc.com
Fri Mar 22 01:08:11 CET 2024



Thank you


So if "Copying bitmap of peer node_id=0" on reconnect after interruption, indicates the issue, the issue still exists for me.

I am able to dump the metadata, but not sure it is very useful at this point... 

I have not tried invalidating it after a mount/unmount, nor have I tried invalidating it after adding a node, but we were trying to avoid unmounting once configured. 

Would you recommend against going back to a release version prior to this change?

Is there any other information I can provide that would help ?  Could I dump the meta data at any some point to show the expected/unexpected state? 

Latest flow is below

Thank you so much for your assistance,
Tim

1. /dev/vg/persist mounted directly without drbd
2. Enable DRBD by creating a single node configuration file
3. Reboot
4. Create metadata on separate disk (--max-peers=5)
5. drdbadm up persist
6. drbdadm invalidate persist
7. drbdadm primary --force persist
8. drbdadm down persist
9. drbdadm up persist
10. drbdadm invalidate persist*
11. drbdadm primary --force persist
12. mount /dev/drbd0 to /persist
13. start using that mount point
14. some time later
15. Modify configuration to add new target backup node 
16. Copy config to remote node and reboot, it will restart in secondary
17. drbdadm adjust persist (on primary)
18. secondary comes up and initial sync starts
19. stop at 50% by disabling network interface
20. re-enable network interface
21. sync completes right away - node-id 0 message here
22. drbdadm verify persist - fails many blocks




From: Joel Colledge <joel.colledge at linbit.com>
Sent: Wednesday, March 20, 2024 12:02 AM
To: Tim Westbrook <Tim_Westbrook at selinc.com>
Cc: drbd-user at lists.linbit.com <drbd-user at lists.linbit.com>
Subject: Re: Usynced blocks if replication is interrupted during initial sync
 
[Caution - External]

> We are still seeing the issue as described but perhaps I am not putting the invalidate
> at the right spot
>
> Note - I've added it at step 6 below, but I'm wondering if it should be after
> the additional node is configured and adjusted (in which case I would need to
> unmount as apparently you can't invalidate a disk in use)
>
> So do I need to invalidate after every node is added?

With my reproducer, the workaround at step 6 works.

> Also Note, the node-id in the logs from the kernel is 0 but peers are configured with 1 and 2 ,
> is this an issue or they separate ids?

I presume you are referring to the line:
"Copying bitmap of peer node_id=0"
The reason that node ID 0 appears here is that DRBD stores a bitmap of
the blocks that have changed since it was first brought up. This is
the "day0" bitmap. This is stored in all unused bitmap slots. All
unused node IDs point to one of these bitmaps. In this case, node ID 0
is unused. So this line means that it is using the day0 bitmap here.
This is unexpected, as mentioned in my previous reply.

Joel


More information about the drbd-user mailing list