Lxd: Use migration sink/source mechanism for local copies/moves

Created on 17 May 2019  路  20Comments  路  Source: lxc/lxd

Required information

  • Distribution: CentOS 7
  • lxc-info
...
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  driver: lxc
  driver_version: 3.1.0
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.0.11-1.el7.elrepo.x86_64
  lxc_features:
    mount_injection_file: "true"
    network_gateway_device_route: "false"
    network_ipvlan: "false"
    network_l2proxy: "false"
    seccomp_notify: "false"
  project: default
  server: lxd
  server_clustered: false
  server_name: killer-queen
  server_pid: 31367
  server_version: "3.13"
  storage: btrfs
  storage_version: "4.4"
...

I have two storage volumes, both are using btrfs. I moved an existing container from one to another using lxc move container_name --target volume_2 container_2, and after that docker stopped working on that machine.

After closer inspection I found that subvolumes were treated as regular directories during move.

Steps to reproduce

Initialize two btrfs pools, e.g C1 and C2.

  1. lxc launch -s C1 ubuntu box1
  2. In box1: btrfs su cr /testVol. Make sure subvolume is created via btrfs su sh /testVol
  3. lxc move box1 -s C2 box2
  4. In box2: btrfs su sh /testVol. No subvolume will be found.
Bug

Most helpful comment

This has been done now for all drivers except CEPH (which doesn't really have that concept anyway but will be ported in 3.20).

All 20 comments

Hmm, that suggests that the move didn't use the btrfs migration codepath, I don't really remember how that logic works.

@brauner do you?

@stgraber https://github.com/lxc/lxd/blob/master/lxd/storage_btrfs.go#L2593
This looks pretty much like a culprit. I can check (tomorrow), whether adding error in place of fallback to rsync triggers it.

If it does, then there should probably be a more complex decision logic, because rsync is clearly destructive to subvolumes.

tomorrow

I am failing to build it for half an hour now, so probably not today.

Not very familiar with this, does it mean that if the source and target storage pools are btrfs we should always use the specialized btrfs migration code instead of rsync? Regardless of whether the container is running in a user namespace or not.

One thing that puzzles me a bit is that the storage.MigrationType() interface method only takes into account one storage pool and not the combination of two storage pools (source and destination). But perhaps it's just me not understanding the whole mechanics.

@freeekanayaka So the check for user namespace should stay as I don't believe the btrfs send/receive works at all if either side is running in a user namespace.

The way the migration code works normally is that one side connects to the other, sends a migration struct describing what's to be migrated and what its ideal migration handler is (in this case btrfs), the target then considers that information and assembles its own migration struct, if the handler matches it puts that one in, if it doesn't match, it puts rsync as a generic fallback.

To be fair, this entire logic is rather complex, mixes some protobuf stuff and isn't the easiest to read and wrap your head around. Let me know if something is unclear and I'll see if I remember it somehow :)

@cab404 can you confirm that both LXD servers are physical servers or virtual machines and aren't themselves containers?

@stgraber thanks for the explanation, it's pretty clear and doesn't struck me as particularly complex logic (high-level-wise). I assume that the actual implementation is somewhat convoluted though, and perhaps mixes some concerns/layers (since you mention protobuf). But I'd need to give it a look.

If btrfs send/receive don't work at all within user namespaces, it's not obvious to me what we can do to solve this.

can you confirm that both LXD servers are physical servers

@stgraber I am using one physical server with two btrfs pools

@freeekanayaka so based on above, neither LXD instances are running inside a userns so send/receive should work fine.

@stgraber and also it is installed via snap)

@stgraber hm then I'm confused. If the issue is not that LXD is running in a userns, what's going on?

@freeekanayaka so what we need to do here is:

  • Reproduce the issue ourselves, add two btrfs storage pools to a LXD instance, copy a container between the two after creating a subvolume inside the container, confirm the subvolume is turned into a plain old directory on the copied container
  • Check what migration sink/source was used, so we know if the issue is that we're using rsync when we shouldn't or if we're using btrfs send/receive but the nested subvols aren't being copied properly

@stgraber the issue is reproducible using the steps indicated by @cab404 and by you. The culprit seems to be that as part of the move we first copy the container, and when copying a btrfs container between two pools we use rsync. See:

https://github.com/lxc/lxd/blob/master/lxd/storage_btrfs.go#L1163

and

https://github.com/lxc/lxd/blob/master/lxd/storage_btrfs.go#L1134

what do you think would be the best way to fix this? It feels it's mainly a matter of refactoring, so we could swap rsync with some other equivalent btrfs-based logic that we have already in place. But I'm not familiar with the code base.

Indeed, this looks like the source of the issue which also means that annoyingly moving a container between two hosts would do the right thing but doing it locally won't.

I think what we need to do is make our cross-pool copy logic use the same migration sink/source code as network migration so we can pick the best transfer mechanism based on source and target.

Ok thanks, I'll check that, hope it won't require to much re-plumbing/re-factoring.

So yeah, as @stgraber expected this requires a bit more work than some simple logical switches and rewiring for changing code paths, since the logic to run btrfs send/receive is a bit coupled-with/hard-coded-into the migration logic that we use over the network. It doesn't look too terrible, and from what I understand the protobuf part is only relevant for metadata, not for the raw btrfs data, but still it requires a certain amount of refactoring that we might want to defer at some point, since we're so close to release and we plan to introduce the new storage and storage driver interfaces.

Moved to 3.15

ooof

@cab404 ?

Note that this is currently being worked on with the storage rework that @tomponline is doing. We already have the dir backend using the new migration logic there which will solve this issue.

This has been done now for all drivers except CEPH (which doesn't really have that concept anyway but will be ported in 3.20).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sforteva picture sforteva  路  3Comments

jsnjack picture jsnjack  路  3Comments

mt-caret picture mt-caret  路  3Comments

AndreiPashkin picture AndreiPashkin  路  5Comments

tebanep picture tebanep  路  5Comments