[DRBD-user] Proxmox with Linstor: Online migration / disk move problem
Roland Kammerer
roland.kammerer at linbit.com
Tue Nov 23 08:41:05 CET 2021
On Mon, Nov 22, 2021 at 03:28:10PM +0100, Łukasz Wąsikowski wrote:
> Hi,
>
> I'm trying to migrate VM storage to Linstor SDS and have some odd troubles.
> All nodes are running Proxmox VE 7.1:
>
> pve-manager/7.1-5/6fe299a0 (running kernel: 5.13.19-1-pve)
>
> Linstor storage is, for now, on one host. When I create new VM on linstor it
> works. When I try to migrate VM from another host (and another storage) to
> Linstor it fails:
>
> 2021-11-22 13:06:53 starting migration of VM 116 to node 'proxmox-ve3'
> (192.168.8.203)
> 2021-11-22 13:06:53 found local disk 'local-lvm:vm-116-disk-0' (in current
> VM config)
> 2021-11-22 13:06:53 starting VM 116 on remote node 'proxmox-ve3'
> 2021-11-22 13:07:01 volume 'local-lvm:vm-116-disk-0' is
> 'linstor-local:vm-116-disk-1' on the target
> 2021-11-22 13:07:01 start remote tunnel
> 2021-11-22 13:07:03 ssh tunnel ver 1
> 2021-11-22 13:07:03 starting storage migration
> 2021-11-22 13:07:03 scsi1: start migration to
> nbd:unix:/run/qemu-server/116_nbd.migrate:exportname=drive-scsi1
> drive mirror is starting for drive-scsi1 with bandwidth limit: 51200 KB/s
> drive-scsi1: Cancelling block job
> drive-scsi1: Done.
> 2021-11-22 13:07:03 ERROR: online migrate failure - block job (mirror)
> error: drive-scsi1: 'mirror' has been cancelled
> 2021-11-22 13:07:03 aborting phase 2 - cleanup resources
> 2021-11-22 13:07:03 migrate_cancel
> 2021-11-22 13:07:08 ERROR: migration finished with problems (duration
> 00:00:16)
> TASK ERROR: migration problems
>
> Linstor volumes are created during migration, no errors in it's logs. I
> don't know why Proxmox is cancelling this job.
>
> When I try to move disk from NFS to Linstor (online) it fails:
>
> create full clone of drive scsi0 (nfs-backup:129/vm-129-disk-0.qcow2)
>
> NOTICE
> Trying to create diskful resource (vm-129-disk-1) on (proxmox-ve3).
> drive mirror is starting for drive-scsi0 with bandwidth limit: 51200 KB/s
> drive-scsi0: Cancelling block job
> drive-scsi0: Done.
> TASK ERROR: storage migration failed: block job (mirror) error: drive-scsi0:
> 'mirror' has been cancelled
>
>
> To move storage to Linstor I have first move it to NFS (online), turn off VM
> and move VM storage offline to Linstor. And bizzare thing is that once I do
> it, I can move this particular VM storage from Linstor to NFS online and
> from NFS to Linstor online. I can also migrate VM online, from Linstor,
> directly to another node and another storage without problems.
>
> I've setup test cluster to reproduce this problem and couldn't - online
> migration to Linstor storage just worked. I don't know why it's not working
> on main cluster - any hints how to debug it?
Hi Łukasz,
I have heard of that once before, but never experienced it myself and so
far no customers complained so I did not dive into it.
If you can reproduce it, that would be highly appreciated. To me it
looks like the plugin and LINSTOR basically did their job, but then
something else happens. This are just random thoughts that might be
complete nonsense:
- maybe some size rounding error and the resulting DRBD device is just a
tiny bit too small. If you can reproduce it, I would check sizes of
source/destination. If it starts writing and fails at the end it
should start writing data. So does it take some time till it fails? Do
you see that some data was written at the beginning of the DRBD block
device that matches the source? But maybe there is already a size
check at the beginning and it fails fast, who knows. Maybe try with a
VM that has exactly the same size as the failing one in production.
- some race and the DRBD device isn't actually ready before the
migration wants to write data. Maybe there is more time before a
disk gets used when a VM is created vs. when existing data is written
to a freshly created device + migration.
- check dmesg to see what happened on DRBD level
- start grepping for the error msgs in pve/pve-storage to see when and
why these errors happen. What tool/function gets called and then
manually call that tool several times in some "linstor spawn &&
$magic_tool" to trigger a race (if there is one).
HTH, rck
More information about the drbd-user
mailing list