When I copy CT from source to destination with --refresh and -d diskid,pool=destpoolname options, the copy never finish (a copy take about 3 minutes, I try ~8 hours).
I can't stop the operation (even if I Ctrl-c 3 times), the only methods I find to stop cleanly, is to add FW rules to cut the line between source and destination.
Only after this cut action, I received an error message :
destination:~# lxc copy source:ramuh ramuh --refresh -d www-data,pool=default
*** add FW rule to stop, and wait for timeout ***
Error: Failed instance creation:
- https://source:8443: Error transferring instance data: Failed creating instance snapshot record "ramuh/snap0": Failed initialising instance: Invalid devices: Device validation failed for "www-data": The "data" storage pool doesn't exist
- https://sourceip:8443: Error transferring instance data: Got error reading source
source:~# lxc storage volume list data
+--------+----------+-------------+--------------+---------+
| TYPE | NAME | DESCRIPTION | CONTENT TYPE | USED BY |
+--------+----------+-------------+--------------+---------+
| custom | www-data | | filesystem | 2 |
+--------+----------+-------------+--------------+---------+
md5-f99639786c1fdde49443db9c27740588
destination:~# lxc storage volume list default
+----------------------+------------------------------------------------------------------+-------------+--------------+---------+
| TYPE | NAME | DESCRIPTION | CONTENT TYPE | USED BY |
+----------------------+------------------------------------------------------------------+-------------+--------------+---------+
| custom | www-data | | filesystem | 2 |
+----------------------+------------------------------------------------------------------+-------------+--------------+---------+
md5-afa5a07cad3c0f30a58919650c2c395e
lxc copy source:ramuh ramuh --refresh --instance-only -d www-data,pool=default
But I lost a lot of time to understand what happens, and because that's in production, I had a lot of sweat :sweat_smile:
source, destination)volume with different pool name on the two serversource, and add volumesnapshot of the CTsource to destination, with -d option to change the pool namePlease ask me if you need more information.
Hmm, I thought @tomponline fixed this a week or so ago.
@stgraber @olivier-lz yes it does sound very similar to https://github.com/lxc/lxd/pull/8161 I'll take a look.
Starting on this now.
So the difference compared to #8161 is that this instance's config is invalid rather than its snapshot. I believe I'm going to need to update the instance creation code to differentiate between user requested and implicit instance creation so we can ignore device validation issues during creation in the latter scenario.
Ah actually there are 2 problems. The specific issue regarding invalid snapshots is that this is a different check that was failing and being run as part of the profile validation before the instance was initialised. PR incoming for that shortly.
@stgraber for instances (not snapshots) with attached disks that fail validation should we prevent the migration?
@tomponline I think it's fine to fail instances since they can get overriden easily enough.
I'm not sure to understand what is invalid in the configuration, the device is the same in the snapshot and in the CT.
source:~# lxc info ramuh
Name: ramuh
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/11/21 23:21 UTC
Status: Running
Type: container
Profiles: default
[...]
Snapshots:
snap0 (taken at 2020/11/24 10:39 UTC) (stateless)
source:~# lxc config show ramuh
architecture: x86_64
config:
image.architecture: amd64
image.description: Debian buster amd64 (20201113_05:24)
image.os: Debian
image.release: buster
image.serial: "20201113_05:24"
image.type: squashfs
image.variant: default
[...]
devices:
www-data:
path: /srv/www-data
pool: data
source: www-data
type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
source:~# lxc config show ramuh/snap0
architecture: x86_64
config:
image.architecture: amd64
image.description: Debian buster amd64 (20201113_05:24)
image.os: Debian
image.release: buster
image.serial: "20201113_05:24"
image.type: squashfs
image.variant: default
[...]
devices:
www-data:
path: /srv/www-data
pool: data
source: www-data
type: disk
ephemeral: false
profiles:
- default
expires_at: 0001-01-01T00:00:00Z
In the destination, the pool name, is not data but default.
Yes, but overriding devices does not override snapshots as they are readonly. So the device in the snapshot remains using the source pool which is missing.
readonly is the key of my misunderstanding :smile:
Most helpful comment
@tomponline I think it's fine to fail instances since they can get overriden easily enough.