Verify consistently fails after rebooting secondary node

Tim Westbrook Tim_Westbrook at selinc.com
Fri Jan 17 19:04:48 CET 2025


Some updates on this issue

To recap 2 nodes, primary and secondary, after an initial sync the secondary is rebooted
and a verify always detects out of sync sectors. 

It seems to occur on 9.2.4 version of the driver as well as well as all of the kernel 
versions we have been using, so appears to be unrelated to any changes we have 
made in system startup. 

We are working around this problem by invalidating the disk and doing a full 
resync after a reboot, this is fairly onerous for large disks. 

We have not been able to verify corruption when no connection is made back to 
another node after the reboot, but this is harder to validate as system may boot with corruption

What expectations should we have for integrity on a shutdown? Reboot? Power loss? 

Where could we look closer at trying to understand this issue? 



________________________________________
From: Tim Westbrook <Tim_Westbrook at selinc.com>
Sent: Tuesday, December 24, 2024 11:01 AM
To: drbd-user at lists.linbit.com <drbd-user at lists.linbit.com>
Subject: Verify consistently fails after rebooting secondary node
 
Hello

We are observing the following issue with resync after reboot.

After rebooting a secondary node (in a 2 or 3 node cluster), the
secondary successfully connects to primary and reports UpToDate, but
when a verify is launched on the secondary node that was rebooted, it reports
out of sync blocks.

If an "invalidate --reset-bitmap=no" is issued on the resource on the secondary
node, the invalidate sync happens quickly and the next verify succeeds with
no out of sync blocks.

This was initially detected when we promoted a backup node and it came up with
disk corruption. We traced this to the reboot occurring before the promotion.

 Versions

The logs attached are using the 9.2.12 version of the driver on the 5.15.173 kernel,
but we have also observed this issue on the 9.2.4 driver with the 5.15.166 kernel

We have not seen the problem on 5.15.151 and version 9.2.4 of the driver.


 Attachments

initsyncandverify_noreboot.txt - drbd logs from system prior to reboot , includes
verify before reboot

verify_after_invalidate_no_reset.txt - drbd logs after reboot show initial failed
verify then, invalidate, then successful verify

dynamic.res - drbd conf file - note use of separate metadata disk - we also


 Secondary Bring Up

Secondary nodes enable drbd "persist" resource as follows
 
 """
  da up all || true
  da secondary persist || true
  da disconnect persist || true
  da -- --discard-my-data connect persist || true
"""


More information about the drbd-user mailing list