Reproduced with:
Reproduced with Linux kernels:
Nomad does not remove cgroups for terminated exec tasks.
This causes that more and more memory is used on the host system by the kernfs_node_cache and task_struct SLAB caches.
This causes that the host system becomes unstable by running out of memory, starting to swap and then page allocation failure happens.
1.) Start a batch job via nomad that:
/bin/lsprohibit_overlap = true)watch -n 1 'find $(ls /sys/fs/cgroup/*/nomad -d) -type d| wc -l', the number is continously growingslabtop -s c -d1, the kernfs_node_cache and task_struct caches are continuously growingSomewhen the system runs out of available memory, swaps and page allocation failures happen.
Fix: Remove cgroups when an exec task terminates
job "example" {
periodic {
cron = "*/1 * * * * * *"
prohibit_overlap = true
}
datacenters = ["sandbox"]
type = "batch"
group "cache" {
count = 1
task "cgroupleak" {
driver = "exec"
config {
command = "/bin/ls"
}
resources {
cpu = 20 # 500 MHz
memory = 10 # 256MB
}
service {
name = "cgroupleak"
}
}
}
}
Thanks @fho . I'll investigate this and update you very soon!
Thanks a lot for the fast response and fix!
@fho anytime! It'll go out in 0.10.3. Thank you so much for reporting it.
For context, Nomad leaked cgroups in a regression since 0.9.0 :(. If an exec task exits with zero exit code, nomad 0.9 didn't clean up the cgroups. Nomad 0.10.2 fixed this issue in https://github.com/hashicorp/nomad/pull/6722 . But systemd cgroup was special, and we didn't properly clean it up; we addressed it in #6839 .
Let us know if you have any questions or further observations!
Most helpful comment
Thanks @fho . I'll investigate this and update you very soon!