[DRBD-user] DRBD corruption with kmod-drbd90-9.1.8-1

Tue Aug 16 20:30:13 CEST 2022

I just had my second DRBD cluster fail after updating 
kmod-drbd90-9.1.8-1 and then upgrading the kernel. I'm not sure if the 
kernel update broke things or if it was because it caused after the 
reboot. About 2 weeks ago there was an update (kmod-drbd90-9.1.8-1) from 
elrepo, which got applied. But then after a kernel update the DRBD meta 
data was corrupt. Here's the gist of the error:

This is using alma-linux 8:

Aug  7 16:41:13 nfs6 kernel: drbd r0: Starting worker thread (from 
drbdsetup [3515])
Aug  7 16:41:13 nfs6 kernel: drbd r0 nfs5: Starting sender thread (from 
drbdsetup [3519])
Aug  7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
Aug  7 16:41:13 nfs6 kernel: attempt to access beyond end of 
device#012sdb1: rw=6144, want=31250710528, limit=31250706432
Aug  7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: 
drbd_md_sync_page_io(,31250710520s,READ) failed with error -5
Aug  7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: Error while reading metadata.

This is from a centos 7 cluster:
Aug 16 11:04:57 v4 kernel: drbd r0 v3: Starting sender thread (from 
drbdsetup [9486])
Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
Aug 16 11:04:57 v4 kernel: attempt to access beyond end of device
Aug 16 11:04:57 v4 kernel: sdb1: rw=1072, want=3905945600, limit=3905943552
Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: 
drbd_md_sync_page_io(,3905945592s,READ) failed with error -5
Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: Error while reading metadata.
Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Called drbdadm -c 
/etc/drbd.conf -v adjust r0
Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Exit code 1
Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command output: 
drbdsetup new-peer r0 0 --_name=v3 --fencing=resource-only 
--protocol=C#012drbdsetup new-path r0 0 ipv4:10.1.4.82:7788 
ipv4:10.1.4.81:7788#012drbdmeta 0 v09 /dev/sdb1 internal 
apply-al#012drbdsetup attach 0 /dev/sdb1 /dev/sdb1 internal
Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command stderr: 0: 
Failure: (118) IO error(s) occurred during initial access to 
meta-data.#012#012additional info from kernel:#012Error while reading 
metadata.#012#012Command 'drbdsetup attach 0 /dev/sdb1 /dev/sdb1 
internal' terminated with exit code 10

Both clusters have been running flawlessly for ~2 years. I was in 
process of building a new DRBD custer to offload the first one when the 
2nd production cluster had a kernel update and ran into the same exact 
issue. On the first cluster (rhel8/alma) I deleted the metadata and 
tried to resync the data over; however, it failed with the same issue. 
I'm in processes of building a new one to fix that broken DRBD cluster. 
In the last 15 years of using DRBD I have never run into any corruption 
issues. I'm at a loss; I thought the first one was a fluke; now I know 
it's not!