Lxd: lxc delete command fails "directory not empty"

Created on 6 Dec 2019  路  4Comments  路  Source: lxc/lxd

Required information

Distribution: Ubuntu
Distribution version: Bionic 18.04
The output of "lxc info":

```
config:
core.https_address: REDACTED
core.trust_password: true
api_extensions:

  • storage_zfs_remove_snapshots
  • container_host_shutdown_timeout
  • container_stop_priority
  • container_syscall_filtering
  • auth_pki
  • container_last_used_at
  • etag
  • patch
  • usb_devices
  • https_allowed_credentials
  • image_compression_algorithm
  • directory_manipulation
  • container_cpu_time
  • storage_zfs_use_refquota
  • storage_lvm_mount_options
  • network
  • profile_usedby
  • container_push
  • container_exec_recording
  • certificate_update
  • container_exec_signal_handling
  • gpu_devices
  • container_image_properties
  • migration_progress
  • id_map
  • network_firewall_filtering
  • network_routes
  • storage
  • file_delete
  • file_append
  • network_dhcp_expiry
  • storage_lvm_vg_rename
  • storage_lvm_thinpool_rename
  • network_vlan
  • image_create_aliases
  • container_stateless_copy
  • container_only_migration
  • storage_zfs_clone_copy
  • unix_device_rename
  • storage_lvm_use_thinpool
  • storage_rsync_bwlimit
  • network_vxlan_interface
  • storage_btrfs_mount_options
  • entity_description
  • image_force_refresh
  • storage_lvm_lv_resizing
  • id_map_base
  • file_symlinks
  • container_push_target
  • network_vlan_physical
  • storage_images_delete
  • container_edit_metadata
  • container_snapshot_stateful_migration
  • storage_driver_ceph
  • storage_ceph_user_name
  • resource_limits
  • storage_volatile_initial_source
  • storage_ceph_force_osd_reuse
  • storage_block_filesystem_btrfs
  • resources
  • kernel_limits
  • storage_api_volume_rename
  • macaroon_authentication
  • network_sriov
  • console
  • restrict_devlxd
  • migration_pre_copy
  • infiniband
  • maas_network
  • devlxd_events
  • proxy
  • network_dhcp_gateway
  • file_get_symlink
  • network_leases
  • unix_device_hotplug
  • storage_api_local_volume_handling
  • operation_description
  • clustering
  • event_lifecycle
  • storage_api_remote_volume_handling
  • nvidia_runtime
  • container_mount_propagation
  • container_backup
  • devlxd_images
  • container_local_cross_pool_handling
  • proxy_unix
  • proxy_udp
  • clustering_join
  • proxy_tcp_udp_multi_port_handling
  • network_state
  • proxy_unix_dac_properties
  • container_protection_delete
  • unix_priv_drop
  • pprof_http
  • proxy_haproxy_protocol
  • network_hwaddr
  • proxy_nat
  • network_nat_order
  • container_full
  • candid_authentication
  • backup_compression
  • candid_config
  • nvidia_runtime_config
  • storage_api_volume_snapshots
  • storage_unmapped
  • projects
  • candid_config_key
  • network_vxlan_ttl
  • container_incremental_copy
  • usb_optional_vendorid
  • snapshot_scheduling
  • container_copy_project
  • clustering_server_address
  • clustering_image_replication
  • container_protection_shift
  • snapshot_expiry
  • container_backup_override_pool
  • snapshot_expiry_creation
  • network_leases_location
  • resources_cpu_socket
  • resources_gpu
  • resources_numa
  • kernel_features
  • id_map_current
  • event_location
  • storage_api_remote_volume_snapshots
  • network_nat_address
  • container_nic_routes
  • rbac
  • cluster_internal_copy
  • seccomp_notify
  • lxc_features
  • container_nic_ipvlan
  • network_vlan_sriov
  • storage_cephfs
  • container_nic_ipfilter
  • resources_v2
  • container_exec_user_group_cwd
  • container_syscall_intercept
  • container_disk_shift
  • storage_shifted
  • resources_infiniband
  • daemon_storage
  • instances
  • image_types
  • resources_disk_sata
  • clustering_roles
  • images_expiry
    api_status: stable
    api_version: "1.0"
    auth: trusted
    public: false
    auth_methods:
  • tls
    environment:
    addresses:

    • REDACTED:8443

      architectures:

    • x86_64

    • i686

      certificate: |

      -----BEGIN CERTIFICATE-----

      REDACTED

      -----END CERTIFICATE-----

      certificate_fingerprint: REDACTED

      driver: lxc

      driver_version: 3.2.1

      kernel: Linux

      kernel_architecture: x86_64

      kernel_features:

      netnsid_getifaddrs: "true"

      seccomp_listener: "true"

      shiftfs: "false"

      uevent_injection: "true"

      unpriv_fscaps: "true"

      kernel_version: 5.0.0-1021-gcp

      lxc_features:

      mount_injection_file: "true"

      network_gateway_device_route: "true"

      network_ipvlan: "true"

      network_l2proxy: "true"

      network_phys_macvlan_mtu: "true"

      seccomp_notify: "true"

      project: default

      server: lxd

      server_clustered: false

      server_name: REDACTED

      server_pid: 34441

      server_version: "3.18"

      storage: zfs

      storage_version: 0.8.2-2~18.04.york0

      ```

Issue description

I think this is probably two bugs, but I don't have any idea how to reproduce the first, I'll just include it as it's important to the setup:

Occasionally, it seems an lxc delete <container> can fail. The ZFS dataset is destroyed, the only thing left is an empty dataset under
snapshots, but the container remains present in LXD's database in the "STOPPED" state. In most cases a subsequent lxc delete <container> cleans things up without issues.

However lately we've had a further issue (the one this issue is about) where the further lxc delete <container> fails as well. I think this is because the dataset is destroyed, and unmounted, but LXD is dropping a backup.yml file in the directory for the container. I think (I have not checked the code) that LXD doesn't check if this directory is empty, it only checks if the dataset is unmounted, then tries to unlink the directory, which fails because it's not empty.

It'd be great if, until the former issue is tracked down (working on it), LXD gracefully handled this situation... because at the moment with this issue there's no way LXD can recover on its own and someone has to shell in, check everything is correct (the container really doesn't exist any more), then remove the file and re-issue the delete command.

Any ideas on how to track down the first issue would be appreciated too, but I'll keep trying to figure it out.

Steps to reproduce

I don't really have good steps to reproduce (can't work out how to get into the first situation or I'd file a bug for that too), but here's the flow on an affected server:

root@lxd:~# zfs list | grep aaa-container
lxd/snapshots/aaa-container                                96K  1.27T       96K  none
root@lxd:~# lxc delete aaa-container 
Error: remove /var/snap/lxd/common/lxd/storage-pools/default/containers/aaa-container: directory not empty
root@lxd:~# ls /var/snap/lxd/common/lxd/storage-pools/default/containers/aaa-container/
backup.yaml
root@lxd:~# rm /var/snap/lxd/common/lxd/storage-pools/default/containers/aaa-container/backup.yaml 
root@lxd:~# lxc delete aaa-container 
root@lxd:~# 

Information to attach

I don't think any of this information is relevant, there's no container logs or anything because the container is deleted. Let me know if that assumption is incorrect.

Most helpful comment

https://github.com/lxc/lxd/pull/6560/commits/7199afba981ece28b40d5230e832307f3b3e0823 in https://github.com/lxc/lxd/pull/6560 handles this type of races. So we've literally written a fix for this accidentally earlier today :)

All 4 comments

https://github.com/lxc/lxd/pull/6560/commits/7199afba981ece28b40d5230e832307f3b3e0823 in https://github.com/lxc/lxd/pull/6560 handles this type of races. So we've literally written a fix for this accidentally earlier today :)

3.19 will have a completely rewritten storage layer so any existing storage bug will most likely be gone, possibly replaced by new, different bugs (as tends to happen when replacing such a large piece of code).

ACK, so should I leave this open, or close it and see if the behaviour shows up again in 3.19?

I'll close it when I merge 6560

Was this page helpful?
0 / 5 - 0 ratings