Lxd: Lxc copy with -d option do not work with snapshot

Created on 25 Nov 2020  路  9Comments  路  Source: lxc/lxd

  • Distribution: Debian
  • Distributions versions: 9.13 => 10.6
  • Debian 9.13 (source) :

    • Kernel version: 4.9.0-9-amd64

    • LXC version: 4.8

    • LXD version: 4.8

    • Storage backend in use: dir

  • Debian 10.6 (destination) :

    • Kernel version: 4.19.0-12-amd64

    • LXC version: 4.8

    • LXD version: 4.8

    • Storage backend in use: btrfs

Issue description

When I copy CT from source to destination with --refresh and -d diskid,pool=destpoolname options, the copy never finish (a copy take about 3 minutes, I try ~8 hours).
I can't stop the operation (even if I Ctrl-c 3 times), the only methods I find to stop cleanly, is to add FW rules to cut the line between source and destination.
Only after this cut action, I received an error message :

destination:~# lxc copy source:ramuh ramuh --refresh -d www-data,pool=default
*** add FW rule to stop, and wait for timeout ***
Error: Failed instance creation:
 - https://source:8443: Error transferring instance data: Failed creating instance snapshot record "ramuh/snap0": Failed initialising instance: Invalid devices: Device validation failed for "www-data": The "data" storage pool doesn't exist
 - https://sourceip:8443: Error transferring instance data: Got error reading source
source:~# lxc storage volume list data
+--------+----------+-------------+--------------+---------+
|  TYPE  |   NAME   | DESCRIPTION | CONTENT TYPE | USED BY |
+--------+----------+-------------+--------------+---------+
| custom | www-data |             | filesystem   | 2       |
+--------+----------+-------------+--------------+---------+



md5-f99639786c1fdde49443db9c27740588



destination:~# lxc storage volume list default
+----------------------+------------------------------------------------------------------+-------------+--------------+---------+
|         TYPE         |                               NAME                               | DESCRIPTION | CONTENT TYPE | USED BY |
+----------------------+------------------------------------------------------------------+-------------+--------------+---------+
| custom               | www-data                                                         |             | filesystem   | 2       |
+----------------------+------------------------------------------------------------------+-------------+--------------+---------+



md5-afa5a07cad3c0f30a58919650c2c395e



lxc copy source:ramuh ramuh --refresh --instance-only -d www-data,pool=default

But I lost a lot of time to understand what happens, and because that's in production, I had a lot of sweat :sweat_smile:

Steps to reproduce

  1. initialize two lxd server (source, destination)
  2. create volume with different pool name on the two server
  3. create CT on server source, and add volume
  4. create snapshot of the CT
  5. copy the CT from source to destination, with -d option to change the pool name

Please ask me if you need more information.

Most helpful comment

@tomponline I think it's fine to fail instances since they can get overriden easily enough.

All 9 comments

Hmm, I thought @tomponline fixed this a week or so ago.

@stgraber @olivier-lz yes it does sound very similar to https://github.com/lxc/lxd/pull/8161 I'll take a look.

Starting on this now.

So the difference compared to #8161 is that this instance's config is invalid rather than its snapshot. I believe I'm going to need to update the instance creation code to differentiate between user requested and implicit instance creation so we can ignore device validation issues during creation in the latter scenario.

Ah actually there are 2 problems. The specific issue regarding invalid snapshots is that this is a different check that was failing and being run as part of the profile validation before the instance was initialised. PR incoming for that shortly.

@stgraber for instances (not snapshots) with attached disks that fail validation should we prevent the migration?

@tomponline I think it's fine to fail instances since they can get overriden easily enough.

I'm not sure to understand what is invalid in the configuration, the device is the same in the snapshot and in the CT.

source:~# lxc info ramuh
Name: ramuh
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/11/21 23:21 UTC
Status: Running
Type: container
Profiles: default
[...]
Snapshots:
  snap0 (taken at 2020/11/24 10:39 UTC) (stateless)
source:~# lxc config show ramuh
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Debian buster amd64 (20201113_05:24)
  image.os: Debian
  image.release: buster
  image.serial: "20201113_05:24"
  image.type: squashfs
  image.variant: default
  [...]
devices:
  www-data:
    path: /srv/www-data
    pool: data
    source: www-data
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
source:~# lxc config show ramuh/snap0
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Debian buster amd64 (20201113_05:24)
  image.os: Debian
  image.release: buster
  image.serial: "20201113_05:24"
  image.type: squashfs
  image.variant: default
  [...]
devices:
  www-data:
    path: /srv/www-data
    pool: data
    source: www-data
    type: disk
ephemeral: false
profiles:
- default
expires_at: 0001-01-01T00:00:00Z

In the destination, the pool name, is not data but default.

Yes, but overriding devices does not override snapshots as they are readonly. So the device in the snapshot remains using the source pool which is missing.

readonly is the key of my misunderstanding :smile:

Was this page helpful?
0 / 5 - 0 ratings