Nomad: Docker Driver Fails With Upper Limit of 262144 CPU Shares

Created on 15 Apr 2020 · 4Comments · Source: hashicorp/nomad

Nomad version

Nomad v0.11.0 (5f8fe0afc894d254e4d3baaeaee1c7a70a44fdc6)

Operating system and Environment details

Amazon Linux 2 with a fixed head node and an auto-scaling group, with scaling driven by Nomad state using a custom cloud metric.

Issue

This came up in the course of troubleshooting issue #7681, and while my intent isn't to issue-spam you guys, I think this is a separate problem that is actively holding back some of my work, unlike the former.

Anyway, I'm experiencing a Docker Driver failure due to an apparent upper limit on CPU shares. I have tested this on c5.18xlarge instances with the following result.

I have also tested this on m5a.24xlarge instances with the following identical result.

I can't even find process_linux.go in the source, so I'm really at a loss here. Any help is greatly appreciated.

Reproduction steps

Submit a job that has more than 262144 CPU shares allocated on a large enough instance to have the job placed, and the Docker driver should fail in the manner I've described.

Source

herter4171

Most helpful comment

Hey @tgross, thanks for the heads-up. This was the first time that troubleshooting led me to a Torvalds repo, and I knew to abandon all hope without being well-versed in operating systems. Hats off to you and your team for the CPU burst capability. Running with 262144 shares on a c5.24xlarge takes up 90% of the capacity, so we'll be just fine, even if another job allocates the remainder from time to time.

herter4171 on 18 Apr 2020

👍2

All 4 comments

@herter4171 You won't find container_linux.go or process_linux.go in nomad or docker codebase. The error is coming from container runtime (runc).

https://github.com/opencontainers/runc/blob/master/libcontainer/container_linux.go#L349

shishir-a412ed on 16 Apr 2020

@shishir-a412ed, thanks for pointing me in the right direction. 262144 is all over the place there.

herter4171 on 16 Apr 2020

👍1

@herter4171 just a heads up, that value is the maximum cpu_share parameter value from the Linux kernel. I'm not sure if there's a tunable for that .