Lxd: Lxc copy with -d option do not work with snapshot

Created on 25 Nov 2020 · 9Comments · Source: lxc/lxd

Distribution: Debian
Distributions versions: 9.13 => 10.6
Debian 9.13 (source) :
- Kernel version: 4.9.0-9-amd64
- LXC version: 4.8
- LXD version: 4.8
- Storage backend in use: dir
Debian 10.6 (destination) :
- Kernel version: 4.19.0-12-amd64
- LXC version: 4.8
- LXD version: 4.8
- Storage backend in use: btrfs

Issue description

When I copy CT from source to destination with --refresh and -d diskid,pool=destpoolname options, the copy never finish (a copy take about 3 minutes, I try ~8 hours).
I can't stop the operation (even if I Ctrl-c 3 times), the only methods I find to stop cleanly, is to add FW rules to cut the line between source and destination.
Only after this cut action, I received an error message :

destination:~# lxc copy source:ramuh ramuh --refresh -d www-data,pool=default
*** add FW rule to stop, and wait for timeout ***
Error: Failed instance creation:
 - https://source:8443: Error transferring instance data: Failed creating instance snapshot record "ramuh/snap0": Failed initialising instance: Invalid devices: Device validation failed for "www-data": The "data" storage pool doesn't exist
 - https://sourceip:8443: Error transferring instance data: Got error reading source

source:~# lxc storage volume list data
+--------+----------+-------------+--------------+---------+
|  TYPE  |   NAME   | DESCRIPTION | CONTENT TYPE | USED BY |
+--------+----------+-------------+--------------+---------+
| custom | www-data |             | filesystem   | 2       |
+--------+----------+-------------+--------------+---------+



md5-f99639786c1fdde49443db9c27740588



destination:~# lxc storage volume list default
+----------------------+------------------------------------------------------------------+-------------+--------------+---------+
|         TYPE         |                               NAME                               | DESCRIPTION | CONTENT TYPE | USED BY |
+----------------------+------------------------------------------------------------------+-------------+--------------+---------+
| custom               | www-data                                                         |             | filesystem   | 2       |
+----------------------+------------------------------------------------------------------+-------------+--------------+---------+



md5-afa5a07cad3c0f30a58919650c2c395e



lxc copy source:ramuh ramuh --refresh --instance-only -d www-data,pool=default

But I lost a lot of time to understand what happens, and because that's in production, I had a lot of sweat :sweat_smile:

Steps to reproduce

initialize two lxd server (source, destination)
create volume with different pool name on the two server
create CT on server source, and add volume
create snapshot of the CT
copy the CT from source to destination, with -d option to change the pool name

Please ask me if you need more information.

Source

olivier-lz

Most helpful comment

@tomponline I think it's fine to fail instances since they can get overriden easily enough.

stgraber on 26 Nov 2020

👍2

All 9 comments

Hmm, I thought @tomponline fixed this a week or so ago.

stgraber on 25 Nov 2020

@stgraber @olivier-lz yes it does sound very similar to https://github.com/lxc/lxd/pull/8161 I'll take a look.

tomponline on 25 Nov 2020

Starting on this now.

tomponline on 26 Nov 2020

👍1

So the difference compared to #8161 is that this instance's config is invalid rather than its snapshot. I believe I'm going to need to update the instance creation code to differentiate between user requested and implicit instance creation so we can ignore device validation issues during creation in the latter scenario.

tomponline on 26 Nov 2020

Ah actually there are 2 problems. The specific issue regarding invalid snapshots is that this is a different check that was failing and being run as part of the profile validation before the instance was initialised. PR incoming for that shortly.

@stgraber for instances (not snapshots) with attached disks that fail validation should we prevent the migration?

tomponline on 26 Nov 2020

@tomponline I think it's fine to fail instances since they can get overriden easily enough.

stgraber on 26 Nov 2020

👍2

I'm not sure to understand what is invalid in the configuration, the device is the same in the snapshot and in the CT.

source:~# lxc info ramuh
Name: ramuh
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/11/21 23:21 UTC
Status: Running
Type: container
Profiles: default
[...]
Snapshots:
  snap0 (taken at 2020/11/24 10:39 UTC) (stateless)

source:~# lxc config show ramuh
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Debian buster amd64 (20201113_05:24)
  image.os: Debian
  image.release: buster
  image.serial: "20201113_05:24"
  image.type: squashfs
  image.variant: default
  [...]
devices:
  www-data:
    path: /srv/www-data
    pool: data
    source: www-data
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

source:~# lxc config show ramuh/snap0
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Debian buster amd64 (20201113_05:24)
  image.os: Debian
  image.release: buster
  image.serial: "20201113_05:24"
  image.type: squashfs
  image.variant: default
  [...]
devices:
  www-data:
    path: /srv/www-data
    pool: data
    source: www-data
    type: disk
ephemeral: false
profiles:
- default
expires_at: 0001-01-01T00:00:00Z

In the destination, the pool name, is not data but default.

olivier-lz on 26 Nov 2020

Yes, but overriding devices does not override snapshots as they are readonly. So the device in the snapshot remains using the source pool which is missing.

tomponline on 26 Nov 2020

👍1

readonly is the key of my misunderstanding :smile:

olivier-lz on 26 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings