[DRBD-user] Proxmox with Linstor: Online migration / disk move problem

Tue Nov 23 08:41:05 CET 2021

On Mon, Nov 22, 2021 at 03:28:10PM +0100, Łukasz Wąsikowski wrote:
> Hi,
> 
> I'm trying to migrate VM storage to Linstor SDS and have some odd troubles.
> All nodes are running Proxmox VE 7.1:
> 
> pve-manager/7.1-5/6fe299a0 (running kernel: 5.13.19-1-pve)
> 
> Linstor storage is, for now, on one host. When I create new VM on linstor it
> works. When I try to migrate VM from another host (and another storage) to
> Linstor it fails:
> 
> 2021-11-22 13:06:53 starting migration of VM 116 to node 'proxmox-ve3'
> (192.168.8.203)
> 2021-11-22 13:06:53 found local disk 'local-lvm:vm-116-disk-0' (in current
> VM config)
> 2021-11-22 13:06:53 starting VM 116 on remote node 'proxmox-ve3'
> 2021-11-22 13:07:01 volume 'local-lvm:vm-116-disk-0' is
> 'linstor-local:vm-116-disk-1' on the target
> 2021-11-22 13:07:01 start remote tunnel
> 2021-11-22 13:07:03 ssh tunnel ver 1
> 2021-11-22 13:07:03 starting storage migration
> 2021-11-22 13:07:03 scsi1: start migration to
> nbd:unix:/run/qemu-server/116_nbd.migrate:exportname=drive-scsi1
> drive mirror is starting for drive-scsi1 with bandwidth limit: 51200 KB/s
> drive-scsi1: Cancelling block job
> drive-scsi1: Done.
> 2021-11-22 13:07:03 ERROR: online migrate failure - block job (mirror)
> error: drive-scsi1: 'mirror' has been cancelled
> 2021-11-22 13:07:03 aborting phase 2 - cleanup resources
> 2021-11-22 13:07:03 migrate_cancel
> 2021-11-22 13:07:08 ERROR: migration finished with problems (duration
> 00:00:16)
> TASK ERROR: migration problems
> 
> Linstor volumes are created during migration, no errors in it's logs. I
> don't know why Proxmox is cancelling this job.
> 
> When I try to move disk from NFS to Linstor (online) it fails:
> 
> create full clone of drive scsi0 (nfs-backup:129/vm-129-disk-0.qcow2)
> 
> NOTICE
> Trying to create diskful resource (vm-129-disk-1) on (proxmox-ve3).
> drive mirror is starting for drive-scsi0 with bandwidth limit: 51200 KB/s
> drive-scsi0: Cancelling block job
> drive-scsi0: Done.
> TASK ERROR: storage migration failed: block job (mirror) error: drive-scsi0:
> 'mirror' has been cancelled
> 
> 
> To move storage to Linstor I have first move it to NFS (online), turn off VM
> and move VM storage offline to Linstor. And bizzare thing is that once I do
> it, I can move this particular VM storage from Linstor to NFS online and
> from NFS to Linstor online. I can also migrate VM online, from Linstor,
> directly to another node and another storage without problems.
> 
> I've setup test cluster to reproduce this problem and couldn't - online
> migration to Linstor storage just worked. I don't know why it's not working
> on main cluster - any hints how to debug it?

Hi Łukasz,

I have heard of that once before, but never experienced it myself and so
far no customers complained so I did not dive into it.

If you can reproduce it, that would be highly appreciated. To me it
looks like the plugin and LINSTOR basically did their job, but then
something else happens. This are just random thoughts that might be
complete nonsense:

- maybe some size rounding error and the resulting DRBD device is just a
  tiny bit too small. If you can reproduce it, I would check sizes of
  source/destination. If it starts writing and fails at the end it
  should start writing data. So does it take some time till it fails? Do
  you see that some data was written at the beginning of the DRBD block
  device that matches the source? But maybe there is already a size
  check at the beginning and it fails fast, who knows. Maybe try with a
  VM that has exactly the same size as the failing one in production.
- some race and the DRBD device isn't actually ready before the
  migration wants to write data. Maybe there is more time before a
  disk gets used when a VM is created vs. when existing data is written
  to a freshly created device + migration.
- check dmesg to see what happened on DRBD level
- start grepping for the error msgs in pve/pve-storage to see when and
  why these errors happen. What tool/function gets called and then
  manually call that tool several times in some "linstor spawn &&
  $magic_tool" to trigger a race (if there is one).

HTH, rck