Ingress-nginx: Worker Processes (auto) doesn't honor cgroup allocations

Created on 16 Aug 2018  Â·  19Comments  Â·  Source: kubernetes/ingress-nginx

Hey,
I've noticed that when using the default value of https://github.com/kubernetes/ingress-nginx/blob/master/docs/user-guide/nginx-configuration/configmap.md#worker-processes, it is getting defaulted to the underlying node vCPU count, rather than that of the container.

It might be better to do a calculation based on the cgroup shares, and if that's not set, fall back to the node cpu count.

Most helpful comment

Not sure what you mean @mmack, @aledbf PR will attempt to discern a good value for the number of worker threads based on your cgroup limits, if those aren't set, it'll default to the nodes number of cpus. With the changes he's put in, a 2cpu and 1gb instance of nginx on a 64vcpu host, won't spawn 64 threads, if you set an appropriate limit (once this is merged).

I still have my reservations about the way it's being calculated however, so in my case I'm opting to set cpu limits, and then using worker-processes: "N" in my configmap to control the number of worker processes.

Either way, once this PR is merged it'll be a lot better.

All 19 comments

@Stono I found this https://github.com/nginxinc/docker-nginx/issues/31#issuecomment-128049321
Can you check if $(cat /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu | wc -w) returns the correct value?

I am working on a more idiomatic fix

Hey
Sorry for the delay, it's been a mad few days.

Interestingly:

www-data@ingress-nginx-internal-controller-5b597b4589-95dbl:/etc/nginx$ cat /sys/fs/cgroup/cpuacct/cpuacct.usage_percpu | wc -w
12

That's returning the node value.

I'm setting the limits like this:

          resources:
            limits:
              cpu: 500m
              memory: 1Gi
            requests:
              cpu: 500m
              memory: 1Gi

Strangely the memory cgroup is fine:

www-data@ingress-nginx-internal-controller-5b597b4589-95dbl:/etc/nginx$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes
1073741824

I think maybe you're looking for this:

www-data@ingress-nginx-internal-controller-5b597b4589-95dbl:/etc/nginx$ cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
50000

@Stono please post the output of cat /sys/fs/cgroup/cpu/cpu.cfs_period_us

pretty useless really:

www-data@ingress-nginx-internal-controller-844fc8dcf5-845k6:/etc/nginx$ cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
100000

/sys/fs/cgroup/cpu/cpu.cfs_quota_us is the one which returns the cpu shares from limits *100

round_up(/sys/fs/cgroup/cpu/cpu.cfs_period_us / /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
100000 / 50000 = 2 (cpus)

This should be used only if /sys/fs/cgroup/cpu/cpu.cfs_period_us > -1

That wouldn't be right though, as I've only given it 0.5?

On Thu, 23 Aug 2018, 10:16 pm Manuel Alejandro de Brito Fontes, <
[email protected]> wrote:

round_up(/sys/fs/cgroup/cpu/cpu.cfs_period_us /
/sys/fs/cgroup/cpu/cpu.cfs_quota_us)
100000 / 50000 = 2 (cpus)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/ingress-nginx/issues/2948#issuecomment-415573614,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABavieF9N1W_uZEWXkdS5kd2ExpA8wpHks5uTxuSgaJpZM4V_TKu
.

@Stono actually it's right https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt

docker run -ti --rm --cpu-quota=1000000 --cpu-period=500000 -v $PWD:/app 89c251c1a806 /app/main
Period: 500000
Quota: 1000000
Cores: 2
package main

import (
    "fmt"
    "io/ioutil"
    "math"
    "path/filepath"
    "runtime"
    "strconv"
    "strings"

    libcontainercgroups "github.com/opencontainers/runc/libcontainer/cgroups"
)

func main() {
    fmt.Printf("Cores: %v\n", numCPU())
}

func numCPU() int {
    cpus := runtime.NumCPU()

    cgroupPath, err := libcontainercgroups.FindCgroupMountpoint("cpu")
    if err != nil {
        return cpus
    }

    cpuQuota := readCgroupFileToInt64(cgroupPath, "cpu.cfs_quota_us")
    cpuPeriod := readCgroupFileToInt64(cgroupPath, "cpu.cfs_period_us")

    fmt.Printf("Period: %v\n", cpuPeriod)
    fmt.Printf("Quota: %v\n", cpuQuota)

    if cpuQuota == -1 || cpuPeriod == -1 {
        return cpus
    }

    return int(math.Ceil(float64(cpuQuota) / float64(cpuPeriod)))
}

func readCgroupFileToInt64(cgroupPath, cgroupFile string) int64 {
    contents, err := ioutil.ReadFile(filepath.Join(cgroupPath, cgroupFile))
    if err != nil {
        return -1
    }
    strValue := strings.TrimSpace(string(contents))
    if value, err := strconv.ParseInt(strValue, 10, 64); err == nil {
        return value
    } else {
        return -1
    }
}

Hmm interesting. I'm still not sure we have got this right 😂 but I'm certainly no expert.

I'll see if there is anyone at work who can wrap their head around this

@aledbf so I've been going off https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu:

Limits and requests for CPU resources are measured in cpu units. One cpu, in Kubernetes, is equivalent to: 1 GCP Core

So taking 1000m(illicores) to = 1 core. My limits above are 500m, or 0.5 cores, but by your calculations it's 2 cores?

@Stono first, cgroup cpu limits != cpu cores because cgroups limits are time-based. In this case (NGINX) we just need a more suited value than real number of cores.
I've been trying to find existing calculations related to this and this one from openjdk uses the same logic I posted before
http://hg.openjdk.java.net/jdk/jdk/file/03f2bfdcb636/src/hotspot/os/linux/osContainer_linux.cpp#l499

Fair enough, like I say, I'm definitely no expert here :) I'm purely going off the Kubernetes docs which tell me that specify "1" for my cpu limit translates into "1" core.

I'll continue to hard code worker threads in the config map to match my limit setting in Kubernetes.

I'm purely going off the Kubernetes docs which tell me that specify "1" for my cpu limit translates into "1" core.

Just in case, I cannot find any helper in the k8s codebase that converts limits to cores

Guys our problem with "auto" being the amount of cpus in a system is this:

Imagine a deployment with a 2 CPU and 1 GB of RAM limit deployed to a) a kubernetes node with 64 CPU's and b) a node with 16 CPU's.

On the big machine nginx spawns 64 processes which exceeds in total the configured limit of 1GB RAM. On the node with 16 CPU's it works...

I would love to have "auto" being the configured cpu limit or the amount of cpus if limit > $amount-of-cpus-on-this-node.

Does this makes sense to you guys?

Not sure what you mean @mmack, @aledbf PR will attempt to discern a good value for the number of worker threads based on your cgroup limits, if those aren't set, it'll default to the nodes number of cpus. With the changes he's put in, a 2cpu and 1gb instance of nginx on a 64vcpu host, won't spawn 64 threads, if you set an appropriate limit (once this is merged).

I still have my reservations about the way it's being calculated however, so in my case I'm opting to set cpu limits, and then using worker-processes: "N" in my configmap to control the number of worker processes.

Either way, once this PR is merged it'll be a lot better.

For future readers trying to follow the logic (as I just did). The code change looks correct (to me) and does

return int(math.Ceil(float64(cpuQuota) / float64(cpuPeriod)))

Where as a comment had a typo and said:

round_up(/sys/fs/cgroup/cpu/cpu.cfs_period_us / /sys/fs/cgroup/cpu/cpu.cfs_quota_us)
100000 / 50000 = 2 (cpus)

Instead it probably should have been

round_up( /sys/fs/cgroup/cpu/cpu.cfs_quota_us / /sys/fs/cgroup/cpu/cpu.cfs_period_us)
round_up(50000 /100000) = round_up(0.5) = 1.0 (cpus)`

I think the worker processes should be based on the resource requests, not the cgroup limits.
For latency critical components like nginx-ingress, the cpu limit needs to be set to something that is never reached, otherwise the process gets throttled and some requests can be delayed by seconds

@aledbf Thoughts? Should I fill a new issue for that?
Alternatively you could tweak the worker calculation to account for some headroom.

For latency critical components like nginx-ingress, the cpu limit needs to be set to something that is never reached,

Agree. That's why you should not set any limits when you start using the ingress controller, at least for a day, where you can see exactly the resource utilization you have with your apps.

resource requests, not the cgroup limits.

What do you mean?

resource requests, not the cgroup limits.

What do you mean?

Right now this runs as many workers as the the limit of cpus it can use. If all workers are utilized 100%, every bit more cpu it would use (due to the go process etc) would cause some throttling.

So instead it would be nice it if uses the resource requests to calculate the number of workers. That's something you set to a value you ideally want to utilize. Does that make sense?
But I guess it's easy enough to set it yourself using the downward api.

Sorry for the late pickup, but I agree with @discordianfish on (1) recommending to set requests for ingress (I talked about it here) and (2) using requests.cpu for the worker thread calculation. Especially since the community, based on experiences from Google Borg, Zalando et al, moves towards disabling limits altogether (e.g. globally on the kubelet via --cpu-cfs-quota=false). This could render PR #2990 ineffective.

Was this page helpful?
0 / 5 - 0 ratings