Lxd: Unable to delete btrfs container with deleted storage

Created on 29 Sep 2018  路  4Comments  路  Source: lxc/lxd

(Moved from https://github.com/lxc/lxc/issues/2654)

Required information

  • Distribution: Ubuntu
  • Distribution version: 18.04.1
  • The output of "lxc info":
config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording                                                                                                                                                                      
- certificate_update                                                                                                                                                                            
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    [OUTPUT TRUNCATED]
    -----END CERTIFICATE-----
  certificate_fingerprint: [OUTPUT TRUNCATED]
  driver: lxc
  driver_version: 3.0.2
  kernel: Linux
  kernel_architecture: x86_64
  kernel_version: 4.15.0-34-generic
  server: lxd
  server_pid: 3818
  server_version: "3.5"
  storage: ""
  storage_version: ""
  server_clustered: false
  server_name: Notebook

Issue description

There was a btrfs storage pool named local and with source /var/lib/lxd/disks/local.img. It used to work perfectly fine, until the file /var/lib/lxd/disks/local.img was deleted outside of LXC (cannot be recovered). It is used by 5 containers, and the storage pool and the containers appear in the output of lxc storage list and lxc list, respectively. Now, none of the storage pool local or its associated containers can be deleted in LXC.

Executing lxc delete container1 (container1 is a container in the missing local storage pool) outputs Error: no such file or directory. Using the --force flag does not change the output. There doesn't seem to be any way to remove container1 from LXC. This, I believe, is a bug or is at least undesirable behavior. Perhaps the only real negative effect is that it limits the namespace for new containers and storage pools.

Desired behaviour

I think that, perhaps through a --delete_from_db flag, LXC could offer an option to remove any references to a container/storage pool from the LXC database.

Steps to reproduce

  1. Create a btrfs storage pool with an image file as a source.
  2. Create a container in that storage pool.
  3. Delete the source image file outside of LXC, for example using bash.
  4. Attempt to delete the container or storage pool in LXC.

Information to attach

Executing lxc storage delete local outputs: Error: storage pool "local" has volumes attached to it.

Executing lxc storage volume list local outputs:

+-----------+-----------+-------------+---------+
|   TYPE    |   NAME    | DESCRIPTION | USED BY |
+-----------+-----------+-------------+---------+
| container | container1   |             | 1       |
+-----------+-----------+-------------+---------+
| container | container2    |             | 1       |
+-----------+-----------+-------------+---------+
| container | container3 |             | 1       |
+-----------+-----------+-------------+---------+
| container | container4    |             | 1       |
+-----------+-----------+-------------+---------+
| container | container5  |             | 1       |
+-----------+-----------+-------------+---------+

The container storage volumes cannot be deleted outside of lxc delete: executing lxc storage volume delete local container/container1 outputs Error: storage volumes of type "container" cannot be deleted with the storage api.

Bug

All 4 comments

Hmm, so normally I'd expect lxc delete container1 to succeed, we have code that should have LXD move on if the storage volume is gone. I guess there's a bug in the btrfs driver somehow, will have to reproduce the issue and fix it.

I couldn't reproduce the exact behavior above but I could reproduce some issues which I've now fixed in my branch.

Note that this will still not allow you to just do lxc delete container1 when you're missing the btrfs storage pool, you will need at a minimum to slot in an empty pool so that LXD moves on:

  • truncate -s 1G /var/lib/lxd/disks/local.img
  • mkfs.btrfs /var/lib/lxd/disks/local.img

I considered changing our checks to not require this part, but the problem with doing so is that this exact same behavior may happen during boot on systems that use encrypted or external storage and we wouldn't want a user to think they've deleted something to then have it show up on disk a minute later causing all kind of annoying conflicts.

I believe our current behavior of allowing an empty storage pool to be put in place of the existing one is sufficient and will generally match with what would happen should a drive go bad and be replaced or a loop file be corrupted. It lets you repair (potentially loosing some data) or just outright replace the partition/file with another one, then try to use whatever was on there and anything which doesn't behave can be deleted from LXD just fine.

The code from @stgraber's last post, with some minor changes, solved my problem. My full code for deleting container1 from local was:

lxd_dir=/var/lib/lxd
pool=local
container=container1

truncate -s 1G $lxd_dir/disks/$pool.img
mkfs.btrfs $lxd_dir/disks/$pool.img

mkdir $lxd_dir/storage-pools/$pool
mount $lxd_dir/disks/$pool.img $lxd_dir/storage-pools/$pool
mkdir $lxd_dir/storage-pools/$pool/$container

lxc delete $container

After deleting all containers from local, the pool can be deleted with

lxc storage delete $pool

umount $lxd_dir/storage-pools/$pool
rm $lxd_dir/storage-pools/$pool

I faced another minor bug. When container1 and local were still in the database, I ran lxc launch container1 -s different_pool, which resulted in an Already exists error, but container1 would mistakenly appear in the output of lxc storage volume list different_pool even though it wasn't created. Running lxc launch container1 -s different_pool after removing container1 from local fixes the problem: container1 no longer appears from lxc storage volume list different_pool, and the name container1 is reusable again.

@mlaradji Ah, I think the extra steps above likely wouldn't have been needed if you had restarted LXD after creating the new btrfs loop file.

As for the already exists and leftover volume issue, I believe I've fixed that in master already a few days ago, we were basically missing a bunch of volume deletion code on error.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

fwaggle picture fwaggle  路  4Comments

rrva picture rrva  路  5Comments

rrva picture rrva  路  5Comments

mt-caret picture mt-caret  路  3Comments

spacekookie picture spacekookie  路  3Comments