Nomad: Nomad should pin tasks to CPUs underneath the hood

Created on 10 Feb 2017 · 20Comments · Source: hashicorp/nomad

Hi,
for our Realtime-Streaming-Cluster Setup it is necessary to set docker --cpuset-cpus="".

I found in the go-dockerclient/container.go a HostConfig:CPUSetCPUs is it possible to set this config via json config?

Example JSON:
.... task "ABC" { driver = "docker" config { image = "image" port_map {http = 8080} } resources { cpusets = "0-15" memory = 32000 network { mbits = 1000 port "http" {} } } }

In the driver/docker.go at line 755 i only found the possibility to set the "CPUShares: int64(task.Resources.CPU)".

Is it possible to add something like this:
`
hostConfig := &docker.HostConfig{
// Convert MB to bytes. This is an absolute value.
Memory: memLimit,
MemorySwap: memLimit, // MemorySwap is memory + swap.
// Binds are used to mount a host volume into the container. We mount a
// local directory for storage and a shared alloc directory that can be
// used to share data between different tasks in the same task group.
Binds: binds,
}

if (len(task.Resources.CPUSETS) != 0) {
// Set shares. This is a relative value.
hostConfig.CPUSetCPUs = int64(task.Resources.CPUSETS),
} else {
// Convert Mhz to shares. This is a relative value.
hostConfig.CPUShares = int64(task.Resources.CPU),
}
`

and run the Docker-Container with the set CPUSetCPUs?

Greetz

stagneeds-discussion themclient typenhancement

Source

Fleischloser

👍10

Most helpful comment

Nomad has managed to fill a fairly poorly serviced niche in the distributed scheduler space with its ability to run both fullvirt and raw exec tasks, however there is still a large focus on using the application as a tool for scheduling tasks that aren't necessary latency sensitive.

Teaching the Nomad scheduler resource (CPU, NUMA, PCI, shm) pinning would make it extremely valuable in the HPC, research and financial space where processing throughput is potentially less valuable than processing latency. I appreciate that the underlying implementation doesn't necessarily factor this in right now, but I cannot +1 this request hard enough.

This can all be achieved with the cgroup integration (to some extent, at least; the PCI and SHM pinning isn't really supported with this mechanism).

Even better to have:

Scheduling support for "cpuset.cpu_exclusive" to avoid noisy neighbour.
Referencing other tasks in Nomad for affinity, e.g. we would like to ensure Task-B has the same NUMA affinity as Task-A.

gmarkey on 27 Feb 2017

👍12

All 20 comments

Hey,

Why do you need to assign the task to individual CPUs. This is likely not a feature we will support/expose to the user as multiple jobs running on the same node could have detrimental performance effects if they have the same cpu pinning.

dadgar on 14 Feb 2017

Hi,

we have not so many nodes but they have a lot of cores (64).

We want to deploy serveral System-Services on each host. Our measurements show that Linux fair scheduling performes badly in this setup.
By adjusting the CPU-Sets we improved the performance by round about 100%.

We are working in a high performance low latency area and we have to be carefull about using resources.

Fleischloser on 15 Feb 2017

There are lots of HPC software that are CPU-aware and works better with CPU affinity enabled. This is one area where we often don't want sharing of cores that's enabled with Nomad's MHz-based CPU allocation. It would be nice if we can specify the number of cores (physical or logical) required by a task and let Nomad manage the --cpuset-* parameters to the Docker driver.

dvusboy on 15 Feb 2017

@Fleischloser @dvusboy Makes sense. Did my point about why this should be an internal decision by Nomad and not set by the individual job make sense though? You can imagine you have two jobs submitted by two different teams pinning on to the same cores. In that case you will have worse performance than allowing Nomad to decide pinning itself.

dadgar on 15 Feb 2017

I get that, and that's why I don't think the job spec should have explicit cpu-set, but rather let Nomad manage those cpu-set related parameters. It should be part of the scheduling constraints. But it may also means, down the road, allowing Nomad to relocate tasks (if that's spec'd as OK) in order to accommodate more tasks with CPU affinity to be allocated to a host.

dvusboy on 15 Feb 2017

👍1

And here's an example of messages from Gromacs that imply with cpu/thread affinity, the application will perform significantly better:

Using 1 MPI thread
Using 8 OpenMP threads


NOTE: The number of threads is not equal to the number of (logical) cores
      and the -pin option is set to auto: will not pin thread to cores.
      This can lead to significant performance degradation.
      Consider using -pin on (and -pinoffset in case you run multiple jobs).


NOTE: Thread affinity setting failed. This can cause performance degradation.
      If you think your settings are correct, ask on the gmx-users list.

dvusboy on 17 Feb 2017

Cool glad we are on the same page. Going to rename the title

dadgar on 17 Feb 2017

For most (micro)service applications, this isn't necessary and the MHz-based scheduling works just fine. But many MPI/OpenMP scientific applications are more finicky and requires affinity for good performance. I would say Nomad should _support pinning_ tasks to CPUs underneath the hood _when a Task asks for it_. But most importantly, don't let user specifies the cpu-set explicitly in the job specification. Thanks.

dvusboy on 17 Feb 2017

This can all be achieved with the cgroup integration (to some extent, at least; the PCI and SHM pinning isn't really supported with this mechanism).

Even better to have:

Scheduling support for "cpuset.cpu_exclusive" to avoid noisy neighbour.
Referencing other tasks in Nomad for affinity, e.g. we would like to ensure Task-B has the same NUMA affinity as Task-A.

gmarkey on 27 Feb 2017

👍12

Another use case is being able to deploy apps using intel's DPDK libraries for fast packet processing which would likely require CPU pinning for optimal performance.

c4milo on 15 Mar 2017

We have JVM instances that require CPU (read: socket) pinning in order to perform because of memory locality ... so it's definitively a thing :)

henrikjohansen on 29 Mar 2017

Being able to set a constraint that only one allocation per CPU core would be a big deal for game server hosting. There are significant performance improvements if a game server can be the only thing running on its assigned core.

joshuaclausen on 4 Oct 2018

Kubernetes has recently announced support for exactly this - https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/

With Nomad/raw Docker, specifying cpuset params is the only way to do this right now, and it doesn't scale well, due to potential clashes between tasks.

For bonus points, take into account the kernel "isolcpus" flag, which is another commonly used tweak in this space.

james-masson on 16 Nov 2018

This would be a really great feature to have. We'd like to be able to partition out a multicore machine and give tasks on that machine integer-value CPU resources. That is, if a machine has 8 vCPUs, we'd like to be able to run 7 tasks on that machine (reserving 1 CPU for system processes), with each container assigned to 1 vCPU (or 2, or 4, etc.). I don't care _which_ CPU it's assigned to, but I don't want to specify arbitrary MHz and I don't want the tasks to share CPUs with other tasks.

wyattanderson on 1 Apr 2019

@james-masson how do you set the cpuset parameters using Nomad/raw Docker?

analytically on 11 Nov 2019

Typo I think @analytically , probably meant "Nomad exec/raw docker", both of which allow you to use cgroup/cpuset/taskset etc.

james-masson on 11 Nov 2019

The ability to set "cpuset-cpus" is also really handy on some big-little arm systems.

sstent on 20 May 2020

this is a super important feature for almost any networked application, such that there are no cross core interrupts or memory accesses with incoming packets. (aka that the NIC pushes them onto the RX queue for the corresponding core, then gets run up through the network stack on that core, and then delivered to the waiting socket on that same core).

so would be amazing if Nomad pinned the process to whichever core it decides to schedule it to. and be able to specify pinning by either logical core or physical core. (as processes per logical core deliver higher throughout, but by physical core lower latencies... as you might want for a database).

it would also be amazing to have a concept of "reserved" cores and "shared" cores. maybe you want to pin application servers to reserved cores each, and then have you 10 periodic cron jobs all share 1 shared core.

but if we currently request 1 logical core from Nomad for a process, and then pin to it from within the process... won't Nomad not schedule any other processes on that core unless you began requesting more cpu resources than your machine had? So we can already do this?

But if we requested 2 logical cores for a process, is there any guarantee we'd get 1 physical core?

victorstewart on 6 Nov 2020

@notnoop, would this be relevant to exec or raw_exec as well? Does it make any sense to open a new ticket for that?

ketzacoatl on 12 Nov 2020

👍1

Thanks @ketzacoatl - re-opening this ticket as PR #8291 only addresses docker and doesn't address the full story of having Nomad manage and pin CPUs.