Autoscaler: [AWS EKS] Not Spinning New Node: "it wouldn't fit if a new node is added"

Created on 10 Sep 2019 · 10Comments · Source: kubernetes/autoscaler

Apologies if I'm missing something posted in previous issues, but I have tried to go through them all and nothing seems to be working.

We are running into an issue where the autoscaler is refusing to spin up a new node on Amazon EKS because it thinks that the pod wouldn't fit on the new node.

  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   68s (x3 over 68s)  default-scheduler   0/2 nodes are available: 2 Insufficient cpu.
  Normal   NotTriggerScaleUp  9s (x6 over 60s)   cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added)

I have checked daemonsets and the autoscaler logs but everything seems to be fine. By default, we are already able to fit two instances of the application on the nodes before they go into Pending and unschedulable.

The error message that I'm seeing must be misleading because I spun up a new worker node manually and it instantly fixes the unschedulable pods and everything is good.

Please let me know of any additional information I can provide to help debug the issue.

Source

bspradling

Most helpful comment

I also found

kubectl get nodes --no-headers | awk '{print $1}' | xargs -I {} sh -c 'echo {}; kubectl describe node {} | grep Allocated -A 5 | grep -ve Event -ve Allocated -ve percent -ve -- ; echo'

from https://jaxenter.com/manage-container-resource-kubernetes-141977.html which produces:

ip-172-16-122-95.eu-central-1.compute.internal
  Resource                    Requests      Limits
  cpu                         1920m (96%)   9300m (465%)
  memory                      3186Mi (40%)  5078Mi (65%)

ip-172-16-147-170.eu-central-1.compute.internal
  Resource                    Requests          Limits
  cpu                         1840m (92%)       15 (750%)
  memory                      1878706688 (22%)  9585354Ki (120%)

omerfsen on 12 Jan 2020

👍4

All 10 comments

Can you post the launch configuration/template of your auto scaling group and your pod/deployment manifest? The details you provided are not enough to dig into this.

devkid on 10 Sep 2019

/sig aws

Jeffwan on 12 Sep 2019

@Jeffwan: The label(s) sig/aws cannot be appled. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

/sig aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 12 Sep 2019

/platform aws

Jeffwan on 12 Sep 2019

I've seen something similar to this - which version of the cluster autoscaler are you running, and which version of k8s is your cluster? I think when the versions are misaligned you can see something like this.

exdx on 16 Sep 2019

I managed to figure out the issue. @Denton24646 happened to be correct that it was a version mismatch between the cluster autoscaler and the cluster. A breaking change in what auto discovery tags it was looking for. k8s.io vs kubernetes.io.

bspradling on 16 Sep 2019

👍1

@bspradling what version was that? I'm currently hitting this but not on EKS.

JoseThen on 21 Oct 2019

I ran into this last night - in case this helps anyone else, in my case it was because I was trying to launch a process with a 2 cpu request onto an m5.large (which has 2 cpus), but I'd forgotten to take into account that my kube-proxy daemonset has a 0.1 cpu request, and my aws-node daemonset has a 0.01 cpu request, and 2.11 > 2. :P

jwalton on 15 Dec 2019

@jwalton nice finding! In this case, we would suggest to reserve compute resources for system daemons and allocatable resources would be more accurate.

Jeffwan on 26 Dec 2019

I also found

kubectl get nodes --no-headers | awk '{print $1}' | xargs -I {} sh -c 'echo {}; kubectl describe node {} | grep Allocated -A 5 | grep -ve Event -ve Allocated -ve percent -ve -- ; echo'

from https://jaxenter.com/manage-container-resource-kubernetes-141977.html which produces:

ip-172-16-122-95.eu-central-1.compute.internal
  Resource                    Requests      Limits
  cpu                         1920m (96%)   9300m (465%)
  memory                      3186Mi (40%)  5078Mi (65%)

ip-172-16-147-170.eu-central-1.compute.internal
  Resource                    Requests          Limits
  cpu                         1840m (92%)       15 (750%)
  memory                      1878706688 (22%)  9585354Ki (120%)

omerfsen on 12 Jan 2020

👍4

Was this page helpful?

0 / 5 - 0 ratings