Pipeline: All containers in a TaskRun's Pod are combined to calculate required resources

Created on 8 Mar 2019 · 25Comments · Source: tektoncd/pipeline

Expected Behavior

We're only running one container per TaskRun pod at a time, so the required resources for that pod should be the max values needed amongst all the containers we're running. While we were using init containers, that's how things worked.

Actual Behavior

With containers instead of init containers, the pod's required resources are the sum of the required resources for each container. See https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#resources for more info on how pod resource requirements are calculated. The result is that a TaskRun of 10 steps results in a pod that thinks it needs the resources to be able to run all 10 step containers concurrently. So the required resources jumps a ton from what it had been with init containers.

Steps to Reproduce the Problem

Run a TaskRun with 10 steps, on a cluster with sufficient resources available for a pod with <10 concurrent containers with the resource requirements of one of those steps.

Additional Info

design help wanted kinbug

Source

abayer

Most helpful comment

Given this (same for both Memory and CPU requests/limits) from https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#specify-a-memory-request-that-is-too-big-for-your-nodes

Memory requests and limits are associated with Containers, but it
is useful to think of a Pod as having a memory request and
limit. The memory request for the Pod is the sum of the memory
requests for all the Containers in the Pod. Likewise, the memory
limit for the Pod is the sum of the limits of all the Containers
in the Pod.

Pod scheduling is based on requests. A Pod is scheduled to run on
a Node only if the Node has enough available memory to satisfy
the Pod’s memory request.

I believe we could choose the largest of the memory/cpu requests, stuff that on a single Container, and leave the requests unset on all the other Containers in the Pod. Containers could keep their individual memory/CPU limits as scheduling is not based on limits.

An example:

Container A requests 512MB, Container B 2048MB, Container C 1024MB.

Today:
Pod gets scheduled with a request of 3584MB, as the requests of each Container get summed up.

Proposal:
Pod gets scheduled with a request of 2048MB, as the requests for all containers are set to 0 except the largest (or first? does it matter?) container, Container B.

bbrowning on 8 Mar 2019

👍2

All 25 comments

fyi @pivotal-nader-ziada @ImJasonH @bobcatfish

abayer on 8 Mar 2019

Given this (same for both Memory and CPU requests/limits) from https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#specify-a-memory-request-that-is-too-big-for-your-nodes

Memory requests and limits are associated with Containers, but it
is useful to think of a Pod as having a memory request and
limit. The memory request for the Pod is the sum of the memory
requests for all the Containers in the Pod. Likewise, the memory
limit for the Pod is the sum of the limits of all the Containers
in the Pod.

Pod scheduling is based on requests. A Pod is scheduled to run on
a Node only if the Node has enough available memory to satisfy
the Pod’s memory request.

An example:

Container A requests 512MB, Container B 2048MB, Container C 1024MB.

Today:
Pod gets scheduled with a request of 3584MB, as the requests of each Container get summed up.

Proposal:
Pod gets scheduled with a request of 2048MB, as the requests for all containers are set to 0 except the largest (or first? does it matter?) container, Container B.

bbrowning on 8 Mar 2019

👍2

Aaaaaah, that sounds reasonable - if I understand correctly, does that mean we'd effectively use the full resources (assuming no limits specified on steps) of the whole pod for each step?

abayer on 8 Mar 2019

To answer my own question above about leaving the resource request on the largest or first pod - you leave it on the largest pod. That's the only pod guaranteed to have a limit greater than or equal to the largest request.

bbrowning on 8 Mar 2019

@abayer If you don't specify limits then yes, you can use as much CPU/memory as the Node has available. But I believe that behavior is unchanged from InitContainers or the current Containers implementation.

You can still set CPU/memory limits per Container and those limits will get enforced. The requests value is just to help the scheduler decide which Node your Pod gets placed on.

bbrowning on 8 Mar 2019

@bbrowning Gotcha - that's actually a nice result anyway. =)

abayer on 8 Mar 2019

@dwnusbaum has a memory of this coming up at the working group meeting on Wednesday, but I wasn't paying enough attention to remember myself. =) But regardless, yeah, I endorse this approach.

abayer on 8 Mar 2019

@bbrowning I thought the containers without any limits set would get the max or default limit, would setting a high limit on one container override that for all containers? or do set this limit on the pod?

If you do not specify a memory limit for a Container, one of the following situations applies:
The Container has no upper bound on the amount of memory it uses. The Container could use all of the memory available on the Node where it is running.
The Container is running in a namespace that has a default memory limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the memory limit.

there is anything in the api currently to set limits on a taskrun so @abayer doesn't that mean it's not set and you are still getting the problem? or are you setting them somehow?

nader-ziada on 8 Mar 2019

I think we're seeing it with no explicit limits set - @jstrachan, can you verify?

abayer on 8 Mar 2019

@pivotal-nader-ziada The scheduling problem stems from a Container's requests - not its limits. And Task Steps are just Containers - https://github.com/tektoncd/pipeline/blob/bbe504a83f66181d0662e16b596ebeb072b099c2/pkg/apis/pipeline/v1alpha1/task_types.go#L49

Containers can set memory and/or CPU requests and/or limits. Unless you're explicitly forbidding the setting of those somewhere in a webhook or reconcile loop that I missed.

Limits are enforced per Container as far as I know. Requests, on the other hand, are summed together for all Containers in a Pod and used by the scheduler when choosing node placement.

bbrowning on 8 Mar 2019

has a memory of this coming up at the working group meeting on Wednesday

Yeah, as I recall the discussion was that this was a known issue that didn't make it into the PR to switch from init container to containers, but that we definitely want to do something to fix it.

Relevant discussion starts around 9:20 in the 3/6/2019 meeting recording.

dwnusbaum on 8 Mar 2019

Containers can set memory and/or CPU requests and/or limits. Unless you're explicitly forbidding the setting of those somewhere in a webhook or reconcile loop that I missed.

@bbrowning you are correct, we don't forbid that anywhere

nader-ziada on 8 Mar 2019

@pivotal-nader-ziada The scheduling problem stems from a Container's requests - not its limits. And Task Steps are just Containers -

@bbrowning @abayer trying to catch up here - as far as I can tell (and I think you're saying this above), the taskrun controller isnt setting any of these values, this would only come into play if a user set the request and/or limit values themselves in the Container spec for the step, is that right?

In steps to repro right now we've got:

Run a TaskRun with 10 steps, on a cluster with sufficient resources available for a pod with <10 concurrent containers with the resource requirements of one of those steps.

_but_ you'd have to actually set the request values for the step in order to for this to be a problem, is that right?

If I'm understanding correctly it seems like we'd have a few options:

@bbrowning 's suggestion, which I think is to take the values the user specified for request and actually change them?
Warn the user but not change anything
Document this behavior but not actually change any code (probably not a great solution 😅 )

(Not sure I'm understanding the problem entirely tho - maybe we can see some example container specs that would have this problem?)

bobcatfish on 13 Mar 2019

I don't think you do need to manually set the request values - @jstrachan hit this in Jenkins X, and we don't explicitly set request values anywhere when generating our Tasks.

Anyway, lemme create a concrete reproduction.

abayer on 20 Mar 2019

got it - https://gist.github.com/abayer/41cc9db17600dfe845f06002fb0492ed - when you get somewhere between 30 and 40 steps in a single Task with no explicit request, with a cluster of 3 n1-standard-4s, autoscaling up to 10, you get

message: 'pod status "PodScheduled":"False"; message: "0/3 nodes are available:
        3 Insufficient cpu."'

abayer on 20 Mar 2019

And fwiw, https://gist.github.com/abayer/1fe81f3813d777167590b2523a3fff6e is the same Task but with an explicit request on the first step, and zeroed-out requests on the others. This one actually works.

abayer on 20 Mar 2019

And it also works with no explicit request on the first step and the rest zeroed out.

abayer on 20 Mar 2019

Awesome, thanks for adding the repro case @abayer !

when you get somewhere between 30 and 40 steps in a single Task with no explicit request

Quick question, are you creating Tasks that have 30+ steps in them?

bobcatfish on 20 Mar 2019

Apparently? =) More likely is that the cases where this showed up were on a cluster running other pipelines etc, so that the nodes weren't as empty as they were in my test case.

abayer on 20 Mar 2019

Whether you have to explicitly define requests in your steps or not depends on how your underlying Kubernetes cluster is configured. Some clusters are configured to provide default request values for all containers if one isn't otherwise specified. See https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/cpu-default-namespace/

In the cluster @abayer is using it sounds like perhaps there is a LimitRange providing default request values here.

bbrowning on 21 Mar 2019

It was GKE, fwiw

abayer on 21 Mar 2019

I'd like to pick this up. @bbrowning's proposal to unset all of the requests except for the largest one seems good to me. A few things that I'm not sure about are:

What should we do if there are no requests set at all? I guess we check if there is a LimitRange in the current namespace and use its defaultRequest setting?
What happens if a container specifies a cpu/memory limit but not a request? This page says that for CPU, the request will be automatically set to the limit. My understanding is that all containers in a Pod for a Task start simultaneously (please correct me if that is wrong!), but their entrypoints wait and cause the containers to be executed in the order specified in the Task. If that is the case, then the requests could easily go over the limit, but I'm not sure what would actually happen (i.e. if everyone requested the maximum amount of memory up front, will the node run out of memory, or will it be ok as long as the containers are just waiting in the entrypoint?). Here are the most relevant docs I could find:

If a Container exceeds its memory request, it is likely that its Pod will be evicted whenever the node runs out of memory.
A Container might or might not be allowed to exceed its CPU limit for extended periods of time. However, it will not be killed for excessive CPU usage.

/assign @dwnusbaum

dwnusbaum on 27 Mar 2019

Here's an example from some local testing I've done showing that we can't just unset requests other than the largest one. Say we have a cluster whose Nodes all have 5Gi of memory, and we want to run a Task with 3 steps with the following resource settings:

steps:
  - name: step1
    ...
    resources:
      limits:
        memory: "3Gi"
      requests:
        memory: "1Gi"
  - name: step2
    ...
    resources:
      limits:
        memory: "3Gi"
      requests:
        memory: "1Gi"
  - name: step3
    ...
    resources:
      limits:
        memory: "3Gi"
      requests:
        memory: "2Gi"

The resulting Pod requires a Node with 4Gi of memory, and each container has a 3Gi limit. If we delete the requests on step1 and step2 so that we only request 2Gi initially, then the limits for step1 and step2 get used as their requests and we end up requesting 8Gi of memory instead, oops!

However, based on my testing it looks like we can set the requested memory to 0 (or maybe better to use some small value like 10-50m for the entrypoint command?) for the containers that aren't requesting the max value. From what I can tell each container is able to use memory as needed up to its limit without subsequent containers being affected.

dwnusbaum on 3 Apr 2019

@dwnusbaum Great catch, and you're right we can't just clear most of the requests but instead need to explicitly request something very small or perhaps 0. I don't actually know if 0 has any special meaning compared to a very small request value.

bbrowning on 3 Apr 2019

👍1

As reported here: https://github.com/tektoncd/pipeline/issues/1045 if there's a LimitRange setting the container requested resources to 0 makes the deployment not work.