[DRBD-user] drbd9.2 resync stuck with drbd_set_in_sync: sector=<...>s size=<...> nonsense!
Nils Juergens
nils at muon.de
Tue Oct 25 15:49:02 CEST 2022
Dear DRBD-users,
we are currently performing an upgrade from proxmox ve-6 to ve-7 on a
three-node linstor/drbd cluster. (Only two nodes are storage+compute
nodes / satellites, third is linstor-controller+quorum node)
This is a testing environment that we built in preparation for the
upgrade of the live cluster.
Before starting the upgrade we were on linstor 1.11, drbd-dkms 9.0.27
and pve 6.3. Our upgrade route was to first upgrade linstor to 1.20,
then upgrade all nodes to pve 6.4 and drbd-9.2 (9.0.27-1 -> 9.2.0-1)
After a fresh boot all nodes we were in a good state. Healthy cluster,
pve6to7 happy, drbd in sync and all packages up-to-date.
We then performed the upgrade of the first node to pve-7 which seemed to
go well and rebooted the first node into pve-7.2-11) As we have three
active VMs with three disk resources this triggered a drbd resync.
Two resources came out fine:
drbd1000 Testserver1: Resync done (total 2 sec; paused 0 sec; 104448 K/sec)
drbd1002 Testserver1: Resync done (total 55 sec; paused 0 sec; 92120 K/sec)
The third resource however did sync about 65% of the outdated data and
then stalled (no more sync traffic, no progress in drbdmon)
The kernel message that seems to be relevant here is this:
drbd vm-101-disk-1/0 drbd1001: drbd_set_in_sync: sector=73703424s
size=134479872 nonsense!
More kernel logs from the pve7 node can be found here
https://pastebin.com/aGjy7Sgp
So far we have tried to reboot the pve7 node, but it will always get
stuck in inconsistent/synctarget (no percentage of progress shown) and
print the kernel error message "drbd_set_in_sync: sector=73703424s
size=134479872 nonsense".
The linstor resources are backed by lvm_thin which is backed by a
MegaRAID in RAID1 with SSD drives.
I don't know if this is relevant, but the VM in question has at some
point in its lifetime been rolled back to a snapshot. (All snapshots
have been removed prior to the upgrades).
At that time the rollback did work OK, but we noticed a huge increase of
the allocated space on the backing device (IIRC it was equal to the
virtual disk size). We have set "discard=on" in proxmox and did a
"fstrim" in the VM, which cut down the space usage, but it's not equal
on both nodes):
root at Testserver3:~# linstor resource list-volumes
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊
DeviceName ┊ Allocated ┊ InUse ┊ State ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ Testserver1 ┊ vm-100-disk-1 ┊ ssd_thin ┊ 0 ┊ 1000 ┊
/dev/drbd1000 ┊ 2.28 GiB ┊ InUse ┊ UpToDate ┊
┊ Testserver2 ┊ vm-100-disk-1 ┊ ssd_thin ┊ 0 ┊ 1000 ┊
/dev/drbd1000 ┊ 2.50 GiB ┊ Unused ┊ UpToDate ┊
┊ Testserver1 ┊ vm-101-disk-1 ┊ ssd_thin ┊ 0 ┊ 1001 ┊
/dev/drbd1001 ┊ 35.38 GiB ┊ InUse ┊ UpToDate ┊
┊ Testserver2 ┊ vm-101-disk-1 ┊ ssd_thin ┊ 0 ┊ 1001 ┊
/dev/drbd1001 ┊ 31.05 GiB ┊ Unused ┊ Inconsistent ┊
┊ Testserver1 ┊ vm-102-disk-1 ┊ ssd_thin ┊ 0 ┊ 1002 ┊
/dev/drbd1002 ┊ 7.04 GiB ┊ InUse ┊ UpToDate ┊
┊ Testserver2 ┊ vm-102-disk-1 ┊ ssd_thin ┊ 0 ┊ 1002 ┊
/dev/drbd1002 ┊ 7.04 GiB ┊ Unused ┊ UpToDate ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
The linstor-created resource looks like this:
https://pastebin.com/syLADBdC
relevant version numbers:
drbd-dkms: 9.2.0-1
linstor-(controller|satellite): 1.20.0-1
linstor-proxmox: 6.1.0-1
proxmox-ve versions: 6.4-1 (two nodes) and 7.2-1 (one node)
kernel: 5.4.203-1-pve (two nodes) and 5.15.64-1-pve (one node)
Any insight on this would be most welcome. I'll provide more details if
you feel something is missing.
thanks and kind regards,
Nils
More information about the drbd-user
mailing list