Lxd: Container won't start, monitor is hung.

Created on 17 Apr 2018 · 45Comments · Source: lxc/lxd

Required information

Distribution: Ubuntu
Distribution version: 16.04.04
The output of "lxc info" or if that fails:
```
$ lxc info
config:
core.https_address: '[::]:8443'
core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
  
  api_status: stable
  
  api_version: "1.0"
  
  auth: trusted
  
  public: false
  
  auth_methods:
- tls
  
  environment:
  
  addresses:
- 192.168.1.1:8443
- '[2604:6000:6342:::::]:8443'
- '[2604:6000:6342:::::]:8443'
- '[2604:6000:6342:::::]:8443'
- 192.168.122.1:8443
  
  architectures:
- x86_64
- i686
  
  certificate: |
  
  -----BEGIN CERTIFICATE-----
  
  MIIFRDCCAyygAwIBAgIRAJOZkpRrAKHaoGlOrtgqFL4wDQYJKoZIhvcNAQELBQAw
  
  NTEcMBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEVMBMGA1UEAwwMcm9vdEBO
  
  ZXBoZWxlMB4XDTE3MTAwNDIwMzU0OVoXDTI3MTAwMjIwMzU0OVowNTEcMBoGA1UE
  
  ChMTbGludXhjb250YWluZXJzLm9yZzEVMBMGA1UEAwwMcm9vdEBOZXBoZWxlMIIC
  
  IjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAt75i6L7oq8xyuD3sdhc/NXAN
  
  d+aeRbbxThGWURJUWxRY9ZDz/gPZr5hCfSPUlnvdL6ieWIqs2otdN3di84yKik3o
  
  EGS5QUJAa3yofa6HkBCUpyUXEFNGG4k8aiuLBv1R46Xgj/Zv8sqxPuKcrU5Y/Ckc
  
  KhHTq0o7xosU/XStq4SpL0Wh0prxEzj6P9luqCLm2+80L5Ri/uRVvOBtyjJqXbPz
  
  N3qnYHBxxHtZo4ipH/qWesajglAKdyZnFRnbdhJbRmmAkHJ3W7LG5L80WcGPKY1i
  
  zq3BfqRbKpwZ8wB2VFs6QnSu5wQKiIrUHOTutvHORFnGvymWuNL9eg+XkEc59RBK
  
  QCu6X+h2CT1BDxzGbUSdaUwQ2NFg/wllNbJDomt6mVxCAoEmZU+vnINqDzCXzmpR
  
  HeqLnu1Nq4IscXn9cJ+Vlvm5Kqd69XTsYIkt0R1k83CtMN6FxCGCZg1Jm9hKWFdr
  
  2ib/g6xmUleT/uI4ElQzYpuss6s959/eamA1REFoj26v8MWIQp2xnWQ34imk/QNV
  
  /HdHvGXpTIS06AOfuT9Jc0k3zjtbzJ/psjeCAiETKozm5vxJ4VB396OJb2eOJhsw
  
  4ULKIunb10M2AJ2UOo7JVv3YZrRyLJPKRUUew0CY1uTHEjZhyw4eXtxH2K0+OEY0
  
  FgaX/wEs23MIGex5VMMCAwEAAaNPME0wDgYDVR0PAQH/BAQDAgWgMBMGA1UdJQQM
  
  MAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwGAYDVR0RBBEwD4IHTmVwaGVsZYcE
  
  wKgBAjANBgkqhkiG9w0BAQsFAAOCAgEAhQpGVhgiBVbs3bS4lpyThPrjMBoLvCh2
  
  6lCXO4YkHO9bcJ3G358LTgLALjs3j1bzk5qHk+Z63Tdsgr0oDCcDLX9HPHkkYQUf
  
  mihO6o1mINOcQ6jjz0y7KGfxD/GexcN7/WH2htbO1VF2p6oBdNsxi0G/lur5ASwt
  
  dJrDRQi7ifP00/DUWKNgZoLQOG5z/hfsjvocLZwmwGMqgY+xW5Gh7vCfljiCPL+x
  
  tKZOpKaZ6FxwY7OlsW802P27sCftRrwn0rvw6jsyY2KHo+LwQxlVHmJtxnT2diZn
  
  NQXFhBT921gzX1kkJdP26JHDlJhv0iL7A3JHt8QzAdxXNfBIQ/Dwu5fILjFbncyS
  
  8y1ulaLQd/fybi1dSaJgdHMnojMQV3c7mydUjB1j21+VlaW14rRmSwCnr3cZtg3m
  
  OndA7luHKvvthekvTyiI+ioDsDsO1DFc6YcljI5nVZkjeBTBJF1ornRnb+TP+XQl
  
  PlG3uSNG0UGoDm7y/wv8WZUwGU+BFiTIsSPyEeYWY92ePYgoLHy1TkaUp17GwBJI
  
  XdARg7L/kwy0pwgLPSiSkSoH1lXDQmSf7RFzrpaUkEcKG7n/tXPR0yCYRy5L1WPM
  
  ESsFJryCN1MW1ftb2Xu0+K15AHvYaAlQbXhTPKR4vPwiWFN9fZiIFXOhhhQoVMOc
  
  wUPmTgDyfNA=
  
  -----END CERTIFICATE-----
  
  certificate_fingerprint: 203defdb53eb4eb5d574fb581e619836f89525e6fed7b29689106fb0cac44010
  
  driver: lxc
  
  driver_version: 3.0.0
  
  kernel: Linux
  
  kernel_architecture: x86_64
  
  kernel_version: 4.16.0-041600-generic
  
  server: lxd
  
  server_pid: 2635
  
  server_version: 3.0.0
  
  storage: btrfs
  
  storage_version: "4.4"
  
  server_clustered: false
  
  server_name: Nephele

# Issue description

Certain container fails to start
Full system reboot fixes it, but restarting lxd alone does not seem to
# Steps to reproduce

 ```
$lxc start container
 --hangs, ctrl-c
$lxc list container
 --container|error|etc
$lxc info container --show-log
 --error: monitor is hung
$sudo kill -9 [[[relevant lxc monitor]]]
$lxc start container
 --hangs, ctrl-c
$lxc info container --show-log
 --error:monitor is hung
$sudo kill -9 [[[relevant lxc monitor]]]
$snap restart lxd

---rinse and repeat, even after $snap restart lxd

I don't need this container, so I won't reset my system until we can figure this out
Let me know if you need any more logs etc

Source

19wolf

Most helpful comment

The problem is that net_mutex is exclusive, and it's taken for both copy_net_ns() and cleanup_net(). The second function makes many heavy sleepable actions under net_mutex such as taking rtnl_lock() and rcu synchronization. So, it's impossible to create new net before cleanup_net() is finished.

I've submitted patches to change the locking of these functions, solving this problem. They will be in 4.17, while the kernel in the issue is 4.16.

tkhai on 18 Apr 2018

👍4

All 45 comments

Can you strace the monitor process, please?

brauner on 17 Apr 2018

Can you run lxc monitor --pretty --type logging in a separate terminal as you start the container? that may give us some pointer as to what LXD's doing.

stgraber on 17 Apr 2018

$ lxc monitor --pretty --type logging
DBUG[04-17|21:22:54] New event listener: ae5db3fc-cf15-4fd6-a326-bcd5c5a36138
DBUG[04-17|21:22:59] handling                                 ip=@ method=GET url=/1.0
DBUG[04-17|21:23:00] handling                                 ip=@ method=GET url=/1.0/containers/Downloader
DBUG[04-17|21:23:00] handling                                 ip=@ method=GET url=/1.0/events
DBUG[04-17|21:23:00] New event listener: e2cbadaa-d449-4e31-b018-0e6f252fdca5
DBUG[04-17|21:23:00] handling                                 ip=@ method=PUT url=/1.0/containers/Downloader/state
DBUG[04-17|21:23:00] New task operation: 83f17b29-da58-4bfc-a562-a3dbe8c19d6a
DBUG[04-17|21:23:00] Started task operation: 83f17b29-da58-4bfc-a562-a3dbe8c19d6a
DBUG[04-17|21:23:00] handling                                 ip=@ method=GET url=/1.0/operations/83f17b29-da58-4bfc-a562-a3dbe8c19d6a
DBUG[04-17|21:23:00] Initializing a BTRFS driver.
DBUG[04-17|21:23:00] Mounting BTRFS storage volume for container "Downloader" on storage pool "default".
DBUG[04-17|21:23:00] Mounting BTRFS storage pool "default".
DBUG[04-17|21:23:00] Mounted BTRFS storage volume for container "Downloader" on storage pool "default".
DBUG[04-17|21:23:00] Mounting BTRFS storage pool "default".
DBUG[04-17|21:23:00] Mounting BTRFS storage volume for container "Downloader" on storage pool "default".
DBUG[04-17|21:23:00] Mounted BTRFS storage volume for container "Downloader" on storage pool "default".
INFO[04-17|21:23:00] Starting container                       action=start created="2017-10-05 02:21:09 +0000 UTC" ephemeral=false name=Downloader stateful=false used="2018-04-17 20:04:06 +0000 UTC"
DBUG[04-17|21:23:01] handling                                 ip=@ method=GET url=/1.0
DBUG[04-17|21:23:01] handling                                 ip=@ method=GET url=/internal/containers/9/onstart
DBUG[04-17|21:23:01] Mounting BTRFS storage volume for container "Downloader" on storage pool "default".
DBUG[04-17|21:23:01] Initializing a BTRFS driver.
DBUG[04-17|21:23:01] Mounting BTRFS storage pool "default".
DBUG[04-17|21:23:01] Mounted BTRFS storage volume for container "Downloader" on storage pool "default".
DBUG[04-17|21:23:01] Scheduler: container Downloader started: re-balancing

after lxc info/monitor is hung the Dbug gives

DBUG[04-17|21:25:24] handling                                 ip=@ method=GET url=/1.0
DBUG[04-17|21:25:24] Disconnected event listener: e2cbadaa-d449-4e31-b018-0e6f252fdca5
DBUG[04-17|21:25:24] handling                                 ip=@ method=GET url=/1.0/containers/Downloader

Can you strace the monitor process, please?

Not really, the specific monitor process doesn't exist until I attempt to start the container and then it gives no output.

19wolf on 17 Apr 2018

After I kill the lxc-monitor the dbug shows

EROR[04-17|21:32:28] balance: Unable to set cpuset            err="Failed to set cgroup cpuset.cpus=\"0,1\": setting cgroup item for the container failed" name=Downloader value=0,1
EROR[04-17|21:32:28] Failed starting container                action=start created="2017-10-05 02:21:09 +0000 UTC" ephemeral=false name=Downloader stateful=false used="2018-04-17 20:04:06 +0000 UTC"
DBUG[04-17|21:32:28] Failure for task operation: 83f17b29-da58-4bfc-a562-a3dbe8c19d6a: Failed to run: /snap/lxd/current/bin/lxd forkstart Downloader /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/Downloader/lxc.conf:
EROR[04-17|21:32:28] Failed to apply network priority         container=Downloader err="Failed to set network device priority: Can't set cgroups on a stopped container"

19wolf on 17 Apr 2018

Oh, I thought we were talking about [lxc monitor]. Nm, can you show the container's config located under:

/var/log/lxd/Downloader/lxc.conf

please?

brauner on 17 Apr 2018

$ cat /var/log/lxd/Downloader/lxc.conf
cat: /var/log/lxd/Downloader/lxc.conf: No such file or directory

$ lxc config show Downloader
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu xenial amd64 (20171004_03:49)
  image.os: Ubuntu
  image.release: xenial
  image.serial: "20171004_03:49"
  limits.cpu.allowance: 10%
  limits.cpu.priority: "1"
  limits.network.priority: "1"
  linux.kernel_modules: ip_tables, ip6_tables
  security.privileged: "true"
  volatile.base_image: 369f84766c53e842cb0239d51c4b78743bd9d4b48dbbfde2020c1e6d9604760f
  volatile.eth0.hwaddr: 00:16:3e:8c:92:4c
  volatile.idmap.base: "0"
  volatile.idmap.next: '[]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
devices:
  hello:
    path: /data/hello
    source: /data/hello
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

19wolf on 17 Apr 2018

Hm, you're using the snap. What's the correct path to show the container's config for the snap, @stgraber?

Also, please show:

cat /proc/1/cgroup
findmnt | grep cgroup

brauner on 17 Apr 2018

Not sure if timing matters for this- while the container is trying to start or whenever -- the container is not trying to start during this output

$ cat /proc/1/cgroup
12:memory:/init.scope
11:freezer:/
10:blkio:/init.scope
9:hugetlb:/
8:cpuset:/
7:devices:/init.scope
6:rdma:/
5:cpu,cpuacct:/init.scope
4:pids:/init.scope
3:perf_event:/
2:net_cls,net_prio:/
1:name=systemd:/init.scope

$ findmnt | grep cgroup
│ ├─/sys/fs/cgroup                    tmpfs                  tmpfs       rw,mode=755
│ │ ├─/sys/fs/cgroup/systemd          cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
│ │ ├─/sys/fs/cgroup/net_cls,net_prio cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,net_cls,net_prio
│ │ ├─/sys/fs/cgroup/perf_event       cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,perf_event,release_agent=/run/cgmanager/agents/cgm-release-agent.perf_event
│ │ ├─/sys/fs/cgroup/pids             cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,pids,release_agent=/run/cgmanager/agents/cgm-release-agent.pids
│ │ ├─/sys/fs/cgroup/cpu,cpuacct      cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,cpu,cpuacct
│ │ ├─/sys/fs/cgroup/rdma             cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,rdma,release_agent=/run/cgmanager/agents/cgm-release-agent.rdma
│ │ ├─/sys/fs/cgroup/devices          cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,devices
│ │ ├─/sys/fs/cgroup/cpuset           cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,cpuset,clone_children
│ │ ├─/sys/fs/cgroup/hugetlb          cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,hugetlb,release_agent=/run/cgmanager/agents/cgm-release-agent.hugetlb
│ │ ├─/sys/fs/cgroup/blkio            cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,blkio
│ │ ├─/sys/fs/cgroup/freezer          cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,freezer
│ │ └─/sys/fs/cgroup/memory           cgroup                 cgroup      rw,nosuid,nodev,noexec,relatime,memory

19wolf on 17 Apr 2018

/var/snap/lxd/common/lxd/logs/Downloader/lxc.conf and /var/snap/lxd/common/lxd/logs/Downloader/lxc.log may be useful

stgraber on 18 Apr 2018

$ cat /var/snap/lxd/common/lxd/logs/Downloader/lxc.conf
lxc.log.file = /var/snap/lxd/common/lxd/logs/Downloader/lxc.log
lxc.log.level = warn
lxc.console.buffer.size = auto
lxc.console.size = auto
lxc.console.logfile = /var/snap/lxd/common/lxd/logs/Downloader/console.log
lxc.cap.drop = sys_time sys_module sys_rawio
lxc.mount.auto = proc:mixed sys:mixed
lxc.autodev = 1
lxc.pty.max = 1024
lxc.mount.entry = mqueue dev/mqueue mqueue rw,relatime,create=dir,optional
lxc.mount.entry = /dev/fuse dev/fuse none bind,create=file,optional
lxc.mount.entry = /dev/net/tun dev/net/tun none bind,create=file,optional
lxc.mount.entry = /proc/sys/fs/binfmt_misc proc/sys/fs/binfmt_misc none rbind,create=dir,optional
lxc.mount.entry = /sys/fs/fuse/connections sys/fs/fuse/connections none rbind,create=dir,optional
lxc.mount.entry = /sys/fs/pstore sys/fs/pstore none rbind,create=dir,optional
lxc.mount.entry = /sys/kernel/debug sys/kernel/debug none rbind,create=dir,optional
lxc.mount.entry = /sys/kernel/security sys/kernel/security none rbind,create=dir,optional
lxc.include = /snap/lxd/current/lxc/config//common.conf.d/
lxc.cgroup.devices.deny = a
lxc.cgroup.devices.allow = b *:* m
lxc.cgroup.devices.allow = c *:* m
lxc.cgroup.devices.allow = c 136:* rwm
lxc.cgroup.devices.allow = c 1:3 rwm
lxc.cgroup.devices.allow = c 1:5 rwm
lxc.cgroup.devices.allow = c 1:7 rwm
lxc.cgroup.devices.allow = c 1:8 rwm
lxc.cgroup.devices.allow = c 1:9 rwm
lxc.cgroup.devices.allow = c 5:0 rwm
lxc.cgroup.devices.allow = c 5:1 rwm
lxc.cgroup.devices.allow = c 5:2 rwm
lxc.cgroup.devices.allow = c 10:229 rwm
lxc.cgroup.devices.allow = c 10:200 rwm
lxc.arch = linux64
lxc.hook.pre-start = /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd 9 start
lxc.hook.post-stop = /snap/lxd/current/bin/lxd callhook /var/snap/lxd/common/lxd 9 stop
lxc.tty.max = 0
lxc.uts.name = Downloader
lxc.mount.entry = /var/snap/lxd/common/lxd/devlxd dev/lxd none bind,create=dir 0 0
lxc.apparmor.profile = lxd-Downloader_</var/snap/lxd/common/lxd>//&:lxd-Downloader_<var-snap-lxd-common-lxd>:
lxc.seccomp.profile = /var/snap/lxd/common/lxd/security/seccomp/Downloader
lxc.environment = http_proxy=
lxc.cgroup.cpu.shares = 115
lxc.cgroup.cpu.cfs_period_us = 100000
lxc.rootfs.path = dir:/var/snap/lxd/common/lxd/containers/Downloader/rootfs
lxc.mount.entry = /var/snap/lxd/common/lxd/devices/Downloader/disk.hello.data-hello data/Downloads none bind,create=dir
lxc.net.0.type = veth
lxc.net.0.flags = up
lxc.net.0.link = br0
lxc.net.0.hwaddr = 00:16:3e:8c:92:4c
lxc.net.0.name = eth0
lxc.mount.entry = /var/snap/lxd/common/lxd/shmounts/Downloader dev/.lxd-mounts none bind,create=dir 0 0

$ cat /var/snap/lxd/common/lxd/logs/Downloader/lxc.log
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-1" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-1"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-2" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-2"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-3" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-3"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-4" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-4"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-5" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-5"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-6" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-6"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-7" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-7"
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:create_path_for_hierarchy:1739 - The cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-8" already existed
lxc 20180417213257.375 ERROR    lxc_cgfsng - cgroups/cgfsng.c:cgfsng_create:1826 - Failed to create cgroup "/sys/fs/cgroup/systemd//lxc/Downloader-8"
lxc 20180417213415.851 ERROR    lxc_container - lxccontainer.c:wait_on_daemonized_start:815 - No such file or directory - Failed to receive the container state
lxc 20180417213415.851 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:130 - Connection reset by peer - Failed to receive response for command "get_cgroup"
lxc 20180417213415.851 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:130 - Connection reset by peer - Failed to receive response for command "get_cgroup"
lxc 20180417213415.851 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:130 - Connection reset by peer - Failed to receive response for command "get_cgroup"
lxc 20180417213415.851 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:130 - Connection reset by peer - Failed to receive response for command "get_cgroup"
lxc 20180417213415.852 WARN     lxc_commands - commands.c:lxc_cmd_rsp_recv:130 - Connection reset by peer - Failed to receive response for command "get_cgroup"

19wolf on 18 Apr 2018

snap info lxd?

stgraber on 18 Apr 2018

$ snap info lxd
name:      lxd
summary:   System container manager and API
publisher: canonical
contact:   https://github.com/lxc/lxd/issues
license:   unknown
description: |
  LXD is a container manager for system containers.

  It offers a REST API to remotely manage containers over the network, using an image based workflow
  and with support for live migration.

  Images are available for all Ubuntu releases and architectures as well as for a wide number of other
  Linux distributions.

  LXD containers are lightweight, secure by default and a great alternative to virtual machines.
commands:
  - lxd.benchmark
  - lxd.check-kernel
  - lxd.database
  - lxd.lxc
  - lxd
  - lxd.migrate
services:
  lxd.daemon: simple, enabled, inactive
snap-id:   J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:  stable
refreshed: 2018-04-09T19:13:08-04:00
installed:       3.0.0       (6643) 56MB -
channels:                           
  stable:        3.0.0       (6643) 56MB -
  candidate:     3.0.0       (6729) 56MB -
  beta:          ↑                       
  edge:          git-7c74528 (6747) 56MB -
  2.0/stable:    2.0.11      (6627) 27MB -
  2.0/candidate: 2.0.11      (6627) 27MB -
  2.0/beta:      ↑                       
  2.0/edge:      git-d71807e (6630) 25MB -
  3.0/stable:    3.0.0       (6649) 56MB -
  3.0/candidate: 3.0.0       (6730) 56MB -
  3.0/beta:      ↑                       
  3.0/edge:      git-bd47a49 (6716) 56MB -

19wolf on 18 Apr 2018

Anything suspicious looking in dmesg?

stgraber on 18 Apr 2018

So one unlikely thing could be that something inside the container might create subcgroups and sets cpuset values. Then LXD won't be able to change them anymore because in cgroup v1 they don't apply recursively unless a specific flag is set that has been added to the kernel just recently.

brauner on 18 Apr 2018

@brauner we're not seeing any cgroup related failure though. The failure is:

lxc 20180417213415.851 ERROR    lxc_container - lxccontainer.c:wait_on_daemonized_start:815 - No such file or directory - Failed to receive the container state

stgraber on 18 Apr 2018

@19wolf can you post ps fauxww when lxc start is hanging too?

stgraber on 18 Apr 2018

@stgraber that's what I was going by

EROR[04-17|21:32:28] balance: Unable to set cpuset            err="Failed to set cgroup cpuset.cpus=\"0,1\": setting cgroup item for the container failed" name=Downloader value=0,1

brauner on 18 Apr 2018

lxc 20180417213415.851 ERROR lxc_container - lxccontainer.c:wait_on_daemonized_start:815 - No such file or directory - Failed to receive the container state

Doesn't really tell you anything. It just means that the socket was closed because the container failed to start.

brauner on 18 Apr 2018

dmesg output from only after attempting to launch

245357.391787] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20180105/exfield-427)
[245357.394708] No Local Variables are initialized for Method [_PMM]
[245357.394709] No Arguments are initialized for method [_PMM]
[245357.394716] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20180105/psparse-550)
[245357.396685] ACPI Error: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20180105/power_meter-338)
[245415.545209] audit: type=1400 audit(1524004327.881:284): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="lxd-Downloader_</var/snap/lxd/common/lxd>" pid=49828 comm="apparmor_parser"
[245415.590049] br0: port 20(vethX87JPV) entered blocking state
[245415.590052] br0: port 20(vethX87JPV) entered disabled state
[245415.590222] device vethX87JPV entered promiscuous mode
[245415.590411] IPv6: ADDRCONF(NETDEV_UP): vethX87JPV: link is not ready

19wolf on 18 Apr 2018

I think this is the relevant output? The only things since 12pm today

$ps fauxww
nephele  50472  0.0  0.0  27332  2932 ?        Ss   18:35   0:00 SCREEN lxc start Downloader
nephele  50473  1.0  0.0  64620 11252 pts/1    Ssl+ 18:35   0:00  \_ lxc start Downloader
root     50502  0.0  0.0 122500  6808 ?        Ds   18:35   0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers Downloader

19wolf on 18 Apr 2018

stuck in I/O wait, that's no good.

Can you post cat /proc/50502/stack?

stgraber on 18 Apr 2018

@brauner The failed rebalance is expected, it was hanging on the container monitor process, once the monitor died, it continued but the container isn't running so the cgroups can't be updated. The question is why isn't the container running :)

stgraber on 18 Apr 2018

👍1

That's why I was asking for the strace of the [lxc monitor] before. :)

brauner on 18 Apr 2018

@brauner strace won't help you much on a process stuck in I/O wait though, it'll just hang strace too :)

stgraber on 18 Apr 2018

I couldn't have known that it was stuck in Ds then though. :)

brauner on 18 Apr 2018

Exciting :)

brauner on 18 Apr 2018

I killed the monitor so I ran it again, hence different pid

$sudo cat /proc/52048/stack
[<0>] copy_net_ns+0xa6/0x230
[<0>] create_new_namespaces+0x11f/0x1d0
[<0>] copy_namespaces+0x73/0xa0
[<0>] copy_process.part.34+0x949/0x1ab0
[<0>] _do_fork+0xcf/0x3b0
[<0>] SyS_clone+0x19/0x20
[<0>] do_syscall_64+0x73/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff

19wolf on 18 Apr 2018

Ok, this is a bug that @Blub and I have seen multiple times.

brauner on 18 Apr 2018

I think that @Blub has already sent something to lkml but he's currently on vacation, I think.

brauner on 18 Apr 2018

See also similar and in earlier kernels (just for me an @Blub): https://github.com/lxc/lxc/issues/2141

brauner on 18 Apr 2018

So do we have the information we need? If so I'd like to restart my system and get the container back (it's not necessary, but I would like it online).

It seems to fail a few times a month (more specifically, an app running in the container fails and it's easiest to just restart the container, but about every 3rd time I do, it doesn't come back up)

19wolf on 18 Apr 2018

Oh yeah, it's that kernel bug... though what's weird is that @19wolf is reporting a 4.16 kernel here so that should have the fix.

stgraber on 18 Apr 2018

The kernel networking code is currently pretty heavily refactored. A bunch of operations have been marked as async that weren't before so the regression potential is there.

brauner on 18 Apr 2018

Anything exciting about br0?
Does it only happen with privileged containers?

brauner on 18 Apr 2018

I think the last time someone touched that code significantly it was Kirill from the Virtuozzo guys. I'll ping him tomorrow and see what he thinks.

brauner on 18 Apr 2018

https://patchwork.kernel.org/patch/10066971/

brauner on 18 Apr 2018

br0 has a lot of veths, but I think that's normal for me.
I just tested it, seems anything I start/restart will hang, privileged or not.

19wolf on 18 Apr 2018

Hi @tkhai, I'm just gonna rope you in here for a second. Seems we're running into each other quite often recently. :) We're seeing an issue with copy_net_ns() that we've seen before and I think you're one of the last persons to touch that code. Any idea what's going on?:

$sudo cat /proc/52048/stack
[<0>] copy_net_ns+0xa6/0x230
[<0>] create_new_namespaces+0x11f/0x1d0
[<0>] copy_namespaces+0x73/0xa0
[<0>] copy_process.part.34+0x949/0x1ab0
[<0>] _do_fork+0xcf/0x3b0
[<0>] SyS_clone+0x19/0x20
[<0>] do_syscall_64+0x73/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff

brauner on 18 Apr 2018

I've submitted patches to change the locking of these functions, solving this problem. They will be in 4.17, while the kernel in the issue is 4.16.

tkhai on 18 Apr 2018

👍4

@tkhai, cool! Thanks for stopping by! :)

brauner on 18 Apr 2018

Closing as this is a kernel issue which has already been resolved upstream.
A temporary workaround may be to try another kernel version while waiting for the fix to make it to your current kernel. Or if it's your own kernel, then cherry-pick the fix.

stgraber on 18 Apr 2018

I'm updating to 4.17-rc1, hopefully the issue doesn't recur

19wolf on 18 Apr 2018

Hiya, sorry to dig up old thread but this looks like the same issue I'm having with a nested container...

Difference is I'm trying 4.15.0-42-generic kernel

Is this something still an issue on the latest kernels?

Cheers,
Jon

root 22999 0.0 0.0 297888 7628 ? Ds Nov27 0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers telegraf1 root@gns3vm:/home/jclayton# cat /proc/22999/stack [<0>] copy_net_ns+0xab/0x230 [<0>] create_new_namespaces+0x11b/0x1e0 [<0>] copy_namespaces+0x6d/0xa0 [<0>] copy_process.part.35+0x941/0x1ad0 [<0>] _do_fork+0xdf/0x3f0 [<0>] SyS_clone+0x19/0x20 [<0>] do_syscall_64+0x73/0x130 [<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [<0>] 0xffffffffffffffff root@gns3vm:/home/jclayton# uname -r 4.15.0-42-generic