Nomad v0.11.0 (5f8fe0afc894d254e4d3baaeaee1c7a70a44fdc6)
Amazon Linux 2 with a fixed head node and an auto-scaling group, with scaling driven by Nomad state using a custom cloud metric.
This came up in the course of troubleshooting issue #7681, and while my intent isn't to issue-spam you guys, I think this is a separate problem that is actively holding back some of my work, unlike the former.
Anyway, I'm experiencing a Docker Driver failure due to an apparent upper limit on CPU shares. I have tested this on c5.18xlarge instances with the following result.

I have also tested this on m5a.24xlarge instances with the following identical result.

I can't even find process_linux.go in the source, so I'm really at a loss here. Any help is greatly appreciated.
Submit a job that has more than 262144 CPU shares allocated on a large enough instance to have the job placed, and the Docker driver should fail in the manner I've described.
@herter4171 You won't find container_linux.go or process_linux.go in nomad or docker codebase. The error is coming from container runtime (runc).
https://github.com/opencontainers/runc/blob/master/libcontainer/container_linux.go#L349
@shishir-a412ed, thanks for pointing me in the right direction. 262144 is all over the place there.
@herter4171 just a heads up, that value is the maximum cpu_share parameter value from the Linux kernel. I'm not sure if there's a tunable for that .
Hey @tgross, thanks for the heads-up. This was the first time that troubleshooting led me to a Torvalds repo, and I knew to abandon all hope without being well-versed in operating systems. Hats off to you and your team for the CPU burst capability. Running with 262144 shares on a c5.24xlarge takes up 90% of the capacity, so we'll be just fine, even if another job allocates the remainder from time to time.
Most helpful comment
Hey @tgross, thanks for the heads-up. This was the first time that troubleshooting led me to a Torvalds repo, and I knew to abandon all hope without being well-versed in operating systems. Hats off to you and your team for the CPU burst capability. Running with 262144 shares on a
c5.24xlargetakes up 90% of the capacity, so we'll be just fine, even if another job allocates the remainder from time to time.