Nomad: Docker Driver Fails With Upper Limit of 262144 CPU Shares

Created on 15 Apr 2020  路  4Comments  路  Source: hashicorp/nomad

Nomad version

Nomad v0.11.0 (5f8fe0afc894d254e4d3baaeaee1c7a70a44fdc6)

Operating system and Environment details

Amazon Linux 2 with a fixed head node and an auto-scaling group, with scaling driven by Nomad state using a custom cloud metric.

Issue

This came up in the course of troubleshooting issue #7681, and while my intent isn't to issue-spam you guys, I think this is a separate problem that is actively holding back some of my work, unlike the former.

Anyway, I'm experiencing a Docker Driver failure due to an apparent upper limit on CPU shares. I have tested this on c5.18xlarge instances with the following result.

I have also tested this on m5a.24xlarge instances with the following identical result.
image

I can't even find process_linux.go in the source, so I'm really at a loss here. Any help is greatly appreciated.

Reproduction steps

Submit a job that has more than 262144 CPU shares allocated on a large enough instance to have the job placed, and the Docker driver should fail in the manner I've described.

Most helpful comment

Hey @tgross, thanks for the heads-up. This was the first time that troubleshooting led me to a Torvalds repo, and I knew to abandon all hope without being well-versed in operating systems. Hats off to you and your team for the CPU burst capability. Running with 262144 shares on a c5.24xlarge takes up 90% of the capacity, so we'll be just fine, even if another job allocates the remainder from time to time.

All 4 comments

@herter4171 You won't find container_linux.go or process_linux.go in nomad or docker codebase. The error is coming from container runtime (runc).

https://github.com/opencontainers/runc/blob/master/libcontainer/container_linux.go#L349

@shishir-a412ed, thanks for pointing me in the right direction. 262144 is all over the place there.

@herter4171 just a heads up, that value is the maximum cpu_share parameter value from the Linux kernel. I'm not sure if there's a tunable for that .

Hey @tgross, thanks for the heads-up. This was the first time that troubleshooting led me to a Torvalds repo, and I knew to abandon all hope without being well-versed in operating systems. Hats off to you and your team for the CPU burst capability. Running with 262144 shares on a c5.24xlarge takes up 90% of the capacity, so we'll be just fine, even if another job allocates the remainder from time to time.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Gerrrr picture Gerrrr  路  3Comments

bdclark picture bdclark  路  3Comments

mlafeldt picture mlafeldt  路  3Comments

joliver picture joliver  路  3Comments

byronwolfman picture byronwolfman  路  3Comments