[DRBD-user] DRBD corruption with kmod-drbd90-9.1.8-1

Tue Aug 16 21:05:32 CEST 2022

Issue at elrepo already reported:
https://elrepo.org/bugs/view.php?id=1250

Brent

On 8/16/2022 11:30 AM, Brent Jensen wrote:
> I just had my second DRBD cluster fail after updating 
> kmod-drbd90-9.1.8-1 and then upgrading the kernel. I'm not sure if the 
> kernel update broke things or if it was because it caused after the 
> reboot. About 2 weeks ago there was an update (kmod-drbd90-9.1.8-1) 
> from elrepo, which got applied. But then after a kernel update the 
> DRBD meta data was corrupt. Here's the gist of the error:
>
> This is using alma-linux 8:
>
> Aug  7 16:41:13 nfs6 kernel: drbd r0: Starting worker thread (from 
> drbdsetup [3515])
> Aug  7 16:41:13 nfs6 kernel: drbd r0 nfs5: Starting sender thread 
> (from drbdsetup [3519])
> Aug  7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
> Aug  7 16:41:13 nfs6 kernel: attempt to access beyond end of 
> device#012sdb1: rw=6144, want=31250710528, limit=31250706432
> Aug  7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: 
> drbd_md_sync_page_io(,31250710520s,READ) failed with error -5
> Aug  7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: Error while reading 
> metadata.
>
> This is from a centos 7 cluster:
> Aug 16 11:04:57 v4 kernel: drbd r0 v3: Starting sender thread (from 
> drbdsetup [9486])
> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
> Aug 16 11:04:57 v4 kernel: attempt to access beyond end of device
> Aug 16 11:04:57 v4 kernel: sdb1: rw=1072, want=3905945600, 
> limit=3905943552
> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: 
> drbd_md_sync_page_io(,3905945592s,READ) failed with error -5
> Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: Error while reading metadata.
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Called drbdadm -c 
> /etc/drbd.conf -v adjust r0
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Exit code 1
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command output: 
> drbdsetup new-peer r0 0 --_name=v3 --fencing=resource-only 
> --protocol=C#012drbdsetup new-path r0 0 ipv4:10.1.4.82:7788 
> ipv4:10.1.4.81:7788#012drbdmeta 0 v09 /dev/sdb1 internal 
> apply-al#012drbdsetup attach 0 /dev/sdb1 /dev/sdb1 internal
> Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command stderr: 0: 
> Failure: (118) IO error(s) occurred during initial access to 
> meta-data.#012#012additional info from kernel:#012Error while reading 
> metadata.#012#012Command 'drbdsetup attach 0 /dev/sdb1 /dev/sdb1 
> internal' terminated with exit code 10
>
> Both clusters have been running flawlessly for ~2 years. I was in 
> process of building a new DRBD custer to offload the first one when 
> the 2nd production cluster had a kernel update and ran into the same 
> exact issue. On the first cluster (rhel8/alma) I deleted the metadata 
> and tried to resync the data over; however, it failed with the same 
> issue. I'm in processes of building a new one to fix that broken DRBD 
> cluster. In the last 15 years of using DRBD I have never run into any 
> corruption issues. I'm at a loss; I thought the first one was a fluke; 
> now I know it's not!
>
> _______________________________________________
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user at lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user