Lxd: lxc delete command fails "directory not empty"

Created on 6 Dec 2019 · 4Comments · Source: lxc/lxd

Required information

Distribution: Ubuntu
Distribution version: Bionic 18.04
The output of "lxc info":

```
config:
core.https_address: REDACTED
core.trust_password: true
api_extensions:

storage_zfs_remove_snapshots
container_host_shutdown_timeout
container_stop_priority
container_syscall_filtering
auth_pki
container_last_used_at
etag
patch
usb_devices
https_allowed_credentials
image_compression_algorithm
directory_manipulation
container_cpu_time
storage_zfs_use_refquota
storage_lvm_mount_options
network
profile_usedby
container_push
container_exec_recording
certificate_update
container_exec_signal_handling
gpu_devices
container_image_properties
migration_progress
id_map
network_firewall_filtering
network_routes
storage
file_delete
file_append
network_dhcp_expiry
storage_lvm_vg_rename
storage_lvm_thinpool_rename
network_vlan
image_create_aliases
container_stateless_copy
container_only_migration
storage_zfs_clone_copy
unix_device_rename
storage_lvm_use_thinpool
storage_rsync_bwlimit
network_vxlan_interface
storage_btrfs_mount_options
entity_description
image_force_refresh
storage_lvm_lv_resizing
id_map_base
file_symlinks
container_push_target
network_vlan_physical
storage_images_delete
container_edit_metadata
container_snapshot_stateful_migration
storage_driver_ceph
storage_ceph_user_name
resource_limits
storage_volatile_initial_source
storage_ceph_force_osd_reuse
storage_block_filesystem_btrfs
resources
kernel_limits
storage_api_volume_rename
macaroon_authentication
network_sriov
console
restrict_devlxd
migration_pre_copy
infiniband
maas_network
devlxd_events
proxy
network_dhcp_gateway
file_get_symlink
network_leases
unix_device_hotplug
storage_api_local_volume_handling
operation_description
clustering
event_lifecycle
storage_api_remote_volume_handling
nvidia_runtime
container_mount_propagation
container_backup
devlxd_images
container_local_cross_pool_handling
proxy_unix
proxy_udp
clustering_join
proxy_tcp_udp_multi_port_handling
network_state
proxy_unix_dac_properties
container_protection_delete
unix_priv_drop
pprof_http
proxy_haproxy_protocol
network_hwaddr
proxy_nat
network_nat_order
container_full
candid_authentication
backup_compression
candid_config
nvidia_runtime_config
storage_api_volume_snapshots
storage_unmapped
projects
candid_config_key
network_vxlan_ttl
container_incremental_copy
usb_optional_vendorid
snapshot_scheduling
container_copy_project
clustering_server_address
clustering_image_replication
container_protection_shift
snapshot_expiry
container_backup_override_pool
snapshot_expiry_creation
network_leases_location
resources_cpu_socket
resources_gpu
resources_numa
kernel_features
id_map_current
event_location
storage_api_remote_volume_snapshots
network_nat_address
container_nic_routes
rbac
cluster_internal_copy
seccomp_notify
lxc_features
container_nic_ipvlan
network_vlan_sriov
storage_cephfs
container_nic_ipfilter
resources_v2
container_exec_user_group_cwd
container_syscall_intercept
container_disk_shift
storage_shifted
resources_infiniband
daemon_storage
instances
image_types
resources_disk_sata
clustering_roles
images_expiry
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
tls
environment:
addresses:
- REDACTED:8443
  
  architectures:
- x86_64
- i686
  
  certificate: |
  
  -----BEGIN CERTIFICATE-----
  
  REDACTED
  
  -----END CERTIFICATE-----
  
  certificate_fingerprint: REDACTED
  
  driver: lxc
  
  driver_version: 3.2.1
  
  kernel: Linux
  
  kernel_architecture: x86_64
  
  kernel_features:
  
  netnsid_getifaddrs: "true"
  
  seccomp_listener: "true"
  
  shiftfs: "false"
  
  uevent_injection: "true"
  
  unpriv_fscaps: "true"
  
  kernel_version: 5.0.0-1021-gcp
  
  lxc_features:
  
  mount_injection_file: "true"
  
  network_gateway_device_route: "true"
  
  network_ipvlan: "true"
  
  network_l2proxy: "true"
  
  network_phys_macvlan_mtu: "true"
  
  seccomp_notify: "true"
  
  project: default
  
  server: lxd
  
  server_clustered: false
  
  server_name: REDACTED
  
  server_pid: 34441
  
  server_version: "3.18"
  
  storage: zfs
  
  storage_version: 0.8.2-2~18.04.york0
  
```

Issue description

I think this is probably two bugs, but I don't have any idea how to reproduce the first, I'll just include it as it's important to the setup:

Occasionally, it seems an lxc delete <container> can fail. The ZFS dataset is destroyed, the only thing left is an empty dataset under
snapshots, but the container remains present in LXD's database in the "STOPPED" state. In most cases a subsequent lxc delete <container> cleans things up without issues.

However lately we've had a further issue (the one this issue is about) where the further lxc delete <container> fails as well. I think this is because the dataset is destroyed, and unmounted, but LXD is dropping a backup.yml file in the directory for the container. I think (I have not checked the code) that LXD doesn't check if this directory is empty, it only checks if the dataset is unmounted, then tries to unlink the directory, which fails because it's not empty.

It'd be great if, until the former issue is tracked down (working on it), LXD gracefully handled this situation... because at the moment with this issue there's no way LXD can recover on its own and someone has to shell in, check everything is correct (the container really doesn't exist any more), then remove the file and re-issue the delete command.

Any ideas on how to track down the first issue would be appreciated too, but I'll keep trying to figure it out.

Steps to reproduce

I don't really have good steps to reproduce (can't work out how to get into the first situation or I'd file a bug for that too), but here's the flow on an affected server:

root@lxd:~# zfs list | grep aaa-container
lxd/snapshots/aaa-container                                96K  1.27T       96K  none
root@lxd:~# lxc delete aaa-container 
Error: remove /var/snap/lxd/common/lxd/storage-pools/default/containers/aaa-container: directory not empty
root@lxd:~# ls /var/snap/lxd/common/lxd/storage-pools/default/containers/aaa-container/
backup.yaml
root@lxd:~# rm /var/snap/lxd/common/lxd/storage-pools/default/containers/aaa-container/backup.yaml 
root@lxd:~# lxc delete aaa-container 
root@lxd:~#

Information to attach

I don't think any of this information is relevant, there's no container logs or anything because the container is deleted. Let me know if that assumption is incorrect.

Source

fwaggle

Most helpful comment

https://github.com/lxc/lxd/pull/6560/commits/7199afba981ece28b40d5230e832307f3b3e0823 in https://github.com/lxc/lxd/pull/6560 handles this type of races. So we've literally written a fix for this accidentally earlier today :)

stgraber on 6 Dec 2019

🎉2

All 4 comments

stgraber on 6 Dec 2019

🎉2

3.19 will have a completely rewritten storage layer so any existing storage bug will most likely be gone, possibly replaced by new, different bugs (as tends to happen when replacing such a large piece of code).

stgraber on 6 Dec 2019

ACK, so should I leave this open, or close it and see if the behaviour shows up again in 3.19?

fwaggle on 6 Dec 2019

I'll close it when I merge 6560

stgraber on 6 Dec 2019

❤1

Was this page helpful?

0 / 5 - 0 ratings