Nomad: Nomad exec driver leaks cgroups, causing host system running out of memory

Created on 9 Dec 2019  路  3Comments  路  Source: hashicorp/nomad

Nomad version

Reproduced with:

  • Nomad v0.10.0 (25ee121d951939504376c70bf8d7950c1ddb6a82)
  • Nomad v0.10.2 (0d2d6e3dc5a171c21f8f31fa117c8a765eb4fc02)

Operating system and Environment details

Reproduced with Linux kernels:

  • Ubuntu 4.15.0-1050-gcp
  • ArchLinux 4.19.87-1-lts

Issue

Nomad does not remove cgroups for terminated exec tasks.
This causes that more and more memory is used on the host system by the kernfs_node_cache and task_struct SLAB caches.
This causes that the host system becomes unstable by running out of memory, starting to swap and then page allocation failure happens.

Reproduction steps

1.) Start a batch job via nomad that:

  • runs a command that is available in the exec chroot and finish fast, e.g. /bin/ls
  • runs periodically every 1 second (optionally with prohibit_overlap = true)
    2.)
  • Monitor the number of cgroups on the system created by nomad,
    e.g. via watch -n 1 'find $(ls /sys/fs/cgroup/*/nomad -d) -type d| wc -l', the number is continously growing
  • Monitor slab caches via slabtop -s c -d1, the kernfs_node_cache and task_struct caches are continuously growing

Somewhen the system runs out of available memory, swaps and page allocation failures happen.

Fix: Remove cgroups when an exec task terminates

Job file (if appropriate)

job "example" {
  periodic {
    cron = "*/1 * * * * * *"
    prohibit_overlap = true
  }
  datacenters = ["sandbox"]
  type = "batch"
  group "cache" {
    count = 1

    task "cgroupleak" {
      driver = "exec"
      config {
        command = "/bin/ls"
      }
      resources {
        cpu    = 20 # 500 MHz
        memory = 10 # 256MB
      }
      service {
        name = "cgroupleak"
      }
    }
  }
}
themdriveexec typbug

Most helpful comment

Thanks @fho . I'll investigate this and update you very soon!

All 3 comments

Thanks @fho . I'll investigate this and update you very soon!

Thanks a lot for the fast response and fix!

@fho anytime! It'll go out in 0.10.3. Thank you so much for reporting it.

For context, Nomad leaked cgroups in a regression since 0.9.0 :(. If an exec task exits with zero exit code, nomad 0.9 didn't clean up the cgroups. Nomad 0.10.2 fixed this issue in https://github.com/hashicorp/nomad/pull/6722 . But systemd cgroup was special, and we didn't properly clean it up; we addressed it in #6839 .

Let us know if you have any questions or further observations!

Was this page helpful?
0 / 5 - 0 ratings