[DRBD-user] drbd9.2 resync stuck with drbd_set_in_sync: sector=<...>s size=<...> nonsense!

Tue Oct 25 15:49:02 CEST 2022

Dear DRBD-users,

we are currently performing an upgrade from proxmox ve-6 to ve-7 on a 
three-node linstor/drbd cluster. (Only two nodes are storage+compute 
nodes / satellites, third is linstor-controller+quorum node)

This is a testing environment that we built in preparation for the 
upgrade of the live cluster.

Before starting the upgrade we were on linstor 1.11, drbd-dkms 9.0.27 
and pve 6.3. Our upgrade route was to first upgrade linstor to 1.20, 
then upgrade all nodes to pve 6.4 and drbd-9.2 (9.0.27-1 -> 9.2.0-1)

After a fresh boot all nodes we were in a good state. Healthy cluster, 
pve6to7 happy, drbd in sync and all packages up-to-date.

We then performed the upgrade of the first node to pve-7 which seemed to 
go well and rebooted the first node into pve-7.2-11) As we have three 
active VMs with three disk resources this triggered a drbd resync.

Two resources came out fine:

drbd1000 Testserver1: Resync done (total 2 sec; paused 0 sec; 104448 K/sec)
drbd1002 Testserver1: Resync done (total 55 sec; paused 0 sec; 92120 K/sec)

The third resource however did sync about 65% of the outdated data and 
then stalled (no more sync traffic, no progress in drbdmon)

The kernel message that seems to be relevant here is this:

drbd vm-101-disk-1/0 drbd1001: drbd_set_in_sync: sector=73703424s 
size=134479872 nonsense!

More kernel logs from the pve7 node can be found here
https://pastebin.com/aGjy7Sgp

So far we have tried to reboot the pve7 node, but it will always get 
stuck in inconsistent/synctarget (no percentage of progress shown) and 
print the kernel error message "drbd_set_in_sync: sector=73703424s 
size=134479872 nonsense".

The linstor resources are backed by lvm_thin which is backed by a 
MegaRAID in RAID1 with SSD drives.

I don't know if this is relevant, but the VM in question has at some 
point in its lifetime been rolled back to a snapshot. (All snapshots 
have been removed prior to the upgrades).

At that time the rollback did work OK, but we noticed a huge increase of 
the allocated space on the backing device (IIRC it was equal to the 
virtual disk size). We have set "discard=on" in proxmox and did a 
"fstrim" in the VM, which cut down the space usage, but it's not equal 
on both nodes):

root at Testserver3:~# linstor resource list-volumes
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node        ┊ Resource      ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ 
DeviceName    ┊ Allocated ┊ InUse  ┊        State ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ Testserver1 ┊ vm-100-disk-1 ┊ ssd_thin    ┊     0 ┊    1000 ┊ 
/dev/drbd1000 ┊  2.28 GiB ┊ InUse  ┊     UpToDate ┊
┊ Testserver2 ┊ vm-100-disk-1 ┊ ssd_thin    ┊     0 ┊    1000 ┊ 
/dev/drbd1000 ┊  2.50 GiB ┊ Unused ┊     UpToDate ┊
┊ Testserver1 ┊ vm-101-disk-1 ┊ ssd_thin    ┊     0 ┊    1001 ┊ 
/dev/drbd1001 ┊ 35.38 GiB ┊ InUse  ┊     UpToDate ┊
┊ Testserver2 ┊ vm-101-disk-1 ┊ ssd_thin    ┊     0 ┊    1001 ┊ 
/dev/drbd1001 ┊ 31.05 GiB ┊ Unused ┊ Inconsistent ┊
┊ Testserver1 ┊ vm-102-disk-1 ┊ ssd_thin    ┊     0 ┊    1002 ┊ 
/dev/drbd1002 ┊  7.04 GiB ┊ InUse  ┊     UpToDate ┊
┊ Testserver2 ┊ vm-102-disk-1 ┊ ssd_thin    ┊     0 ┊    1002 ┊ 
/dev/drbd1002 ┊  7.04 GiB ┊ Unused ┊     UpToDate ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

The linstor-created resource looks like this:
https://pastebin.com/syLADBdC

relevant version numbers:

drbd-dkms: 9.2.0-1
linstor-(controller|satellite): 1.20.0-1
linstor-proxmox: 6.1.0-1
proxmox-ve versions: 6.4-1 (two nodes) and 7.2-1 (one node)
kernel: 5.4.203-1-pve (two nodes) and 5.15.64-1-pve (one node)

Any insight on this would be most welcome. I'll provide more details if 
you feel something is missing.

thanks and kind regards,
Nils