[DRBD-user] DRBD corruption with kmod-drbd90-9.1.8-1

Fri Aug 19 10:14:44 CEST 2022

Am 16.08.22 um 20:30 schrieb Brent Jensen:
> I just had my second DRBD cluster fail after updating
> kmod-drbd90-9.1.8-1 and then upgrading the kernel. I'm not sure if the
> kernel update broke things or if it was because it caused after the
> reboot. About 2 weeks ago there was an update (kmod-drbd90-9.1.8-1) from
> elrepo, which got applied. But then after a kernel update the DRBD meta
> data was corrupt. Here's the gist of the error:
> 
> This is using alma-linux 8:
> 
> Aug  7 16:41:13 nfs6 kernel: drbd r0: Starting worker thread (from
> drbdsetup [3515])
> Aug  7 16:41:13 nfs6 kernel: drbd r0 nfs5: Starting sender thread (from
> drbdsetup [3519])
> Aug  7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
> Aug  7 16:41:13 nfs6 kernel: attempt to access beyond end of
> device#012sdb1: rw=6144, want=31250710528, limit=31250706432
> Aug  7 16:41:13 nfs6 kernel: drbd r0/0 drbd0:
> drbd_md_sync_page_io(,31250710520s,READ) failed with error -5
> Aug  7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: Error while reading metadata.
> 
> This is from a centos 7 cluster:
> Aug 16 11:04:57 v4 kernel: drbd r0 v3: Starting sender thread (from
> drbdsetup [9486])
> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
> Aug 16 11:04:57 v4 kernel: attempt to access beyond end of device
> Aug 16 11:04:57 v4 kernel: sdb1: rw=1072, want=3905945600, limit=3905943552
> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0:
> drbd_md_sync_page_io(,3905945592s,READ) failed with error -5
> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: Error while reading metadata.
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Called drbdadm -c
> /etc/drbd.conf -v adjust r0
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Exit code 1
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command output:
> drbdsetup new-peer r0 0 --_name=v3 --fencing=resource-only
> --protocol=C#012drbdsetup new-path r0 0 ipv4:10.1.4.82:7788
> ipv4:10.1.4.81:7788#012drbdmeta 0 v09 /dev/sdb1 internal
> apply-al#012drbdsetup attach 0 /dev/sdb1 /dev/sdb1 internal
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command stderr: 0:
> Failure: (118) IO error(s) occurred during initial access to
> meta-data.#012#012additional info from kernel:#012Error while reading
> metadata.#012#012Command 'drbdsetup attach 0 /dev/sdb1 /dev/sdb1
> internal' terminated with exit code 10
> 
> Both clusters have been running flawlessly for ~2 years. I was in
> process of building a new DRBD custer to offload the first one when the
> 2nd production cluster had a kernel update and ran into the same exact
> issue. On the first cluster (rhel8/alma) I deleted the metadata and
> tried to resync the data over; however, it failed with the same issue.
> I'm in processes of building a new one to fix that broken DRBD cluster.
> In the last 15 years of using DRBD I have never run into any corruption
> issues. I'm at a loss; I thought the first one was a fluke; now I know
> it's not!
> 
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user at lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user

Hello,

thank you for the report.

We have implemented a fix for this[0] which will be released soon (i.e.
very likely within the next week).

If you easily can (and if this is a non-production system), it would be
great if you could build DRBD from that commit and verify that the fix
resolves the issue for you.

If not, the obvious workaround is to stay on 9.1.7 for now (or downgrade).

[0]
https://github.com/LINBIT/drbd/commit/d7d76aad2b95dee098d6052567aa15d1342b1bc4

-- 
Christoph Böhmwalder
LINBIT | Keeping the Digital World Running
DRBD HA —  Disaster Recovery — Software defined Storage