[DRBD-user] Proxmox with Linstor: Online migration / disk move problem

Tue Nov 23 13:11:55 CET 2021

Hi Roland,

W dniu 2021-11-23 o 08:41, Roland Kammerer pisze:

> I have heard of that once before, but never experienced it myself and so
> far no customers complained so I did not dive into it.
> 
> If you can reproduce it, that would be highly appreciated. To me it
> looks like the plugin and LINSTOR basically did their job, but then
> something else happens. This are just random thoughts that might be
> complete nonsense:
> 
> - maybe some size rounding error and the resulting DRBD device is just a
>    tiny bit too small. If you can reproduce it, I would check sizes of
>    source/destination. If it starts writing and fails at the end it
>    should start writing data. So does it take some time till it fails? Do
>    you see that some data was written at the beginning of the DRBD block
>    device that matches the source? But maybe there is already a size
>    check at the beginning and it fails fast, who knows. Maybe try with a
>    VM that has exactly the same size as the failing one in production.

It fails at the start of the migration. I have two identical (in terms 
of size) VMs - both have 32 GB disk. The one that was not migrated to 
Linstor looks like this:

scsi0: local-lvm:vm-131-disk-0,cache=writeback,size=32G

The one that was migrated (offline, via NFS) to Linstor, which 
originally has size=32G, on Linstor looks like this:

scsi0: linstor-local:vm-125-disk-1,cache=writeback,size=33555416K

33555416 KiB is 32.0009384 GiB, slightly larger than 32 GiB.

> - some race and the DRBD device isn't actually ready before the
>    migration wants to write data. Maybe there is more time before a
>    disk gets used when a VM is created vs. when existing data is written
>    to a freshly created device + migration.

I don't think so (but I may be wrong). "linstor volume list" when I 
start live migration looks like this: https://pastebin.com/FWqbq6uK

It's at "InUse" at some point during migration.

> - check dmesg to see what happened on DRBD level

dmesg of failed migration of VM between nodes, from thin LVM to Linstor, 
dmesg is from target node: https://pastebin.com/rN5ZQ8vN

This VM has two disks:

scsi0: local-lvm:vm-132-disk-0,cache=writeback,size=16M
scsi1: local-lvm:vm-132-disk-1,cache=writeback,size=10244M

This VM was at some point at Linstor, because it's size is 10244M 
instead of 10G (originally it was 10G).

-- 
Best regards,
Łukasz Wąsikowski