Autoscaler: [AWS EKS] Not Spinning New Node: "it wouldn't fit if a new node is added"

Created on 10 Sep 2019  路  10Comments  路  Source: kubernetes/autoscaler

Apologies if I'm missing something posted in previous issues, but I have tried to go through them all and nothing seems to be working.

We are running into an issue where the autoscaler is refusing to spin up a new node on Amazon EKS because it thinks that the pod wouldn't fit on the new node.

  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   68s (x3 over 68s)  default-scheduler   0/2 nodes are available: 2 Insufficient cpu.
  Normal   NotTriggerScaleUp  9s (x6 over 60s)   cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added)

I have checked daemonsets and the autoscaler logs but everything seems to be fine. By default, we are already able to fit two instances of the application on the nodes before they go into Pending and unschedulable.

The error message that I'm seeing must be misleading because I spun up a new worker node manually and it instantly fixes the unschedulable pods and everything is good.

Please let me know of any additional information I can provide to help debug the issue.

Most helpful comment

I also found

kubectl get nodes --no-headers | awk '{print $1}' | xargs -I {} sh -c 'echo {}; kubectl describe node {} | grep Allocated -A 5 | grep -ve Event -ve Allocated -ve percent -ve -- ; echo'

from https://jaxenter.com/manage-container-resource-kubernetes-141977.html which produces:

ip-172-16-122-95.eu-central-1.compute.internal
  Resource                    Requests      Limits
  cpu                         1920m (96%)   9300m (465%)
  memory                      3186Mi (40%)  5078Mi (65%)

ip-172-16-147-170.eu-central-1.compute.internal
  Resource                    Requests          Limits
  cpu                         1840m (92%)       15 (750%)
  memory                      1878706688 (22%)  9585354Ki (120%)

All 10 comments

Can you post the launch configuration/template of your auto scaling group and your pod/deployment manifest? The details you provided are not enough to dig into this.

/sig aws

@Jeffwan: The label(s) sig/aws cannot be appled. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other

In response to this:

/sig aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

/platform aws

I've seen something similar to this - which version of the cluster autoscaler are you running, and which version of k8s is your cluster? I think when the versions are misaligned you can see something like this.

I managed to figure out the issue. @Denton24646 happened to be correct that it was a version mismatch between the cluster autoscaler and the cluster. A breaking change in what auto discovery tags it was looking for. k8s.io vs kubernetes.io.

@bspradling what version was that? I'm currently hitting this but not on EKS.

I ran into this last night - in case this helps anyone else, in my case it was because I was trying to launch a process with a 2 cpu request onto an m5.large (which has 2 cpus), but I'd forgotten to take into account that my kube-proxy daemonset has a 0.1 cpu request, and my aws-node daemonset has a 0.01 cpu request, and 2.11 > 2. :P

@jwalton nice finding! In this case, we would suggest to reserve compute resources for system daemons and allocatable resources would be more accurate.

I also found

kubectl get nodes --no-headers | awk '{print $1}' | xargs -I {} sh -c 'echo {}; kubectl describe node {} | grep Allocated -A 5 | grep -ve Event -ve Allocated -ve percent -ve -- ; echo'

from https://jaxenter.com/manage-container-resource-kubernetes-141977.html which produces:

ip-172-16-122-95.eu-central-1.compute.internal
  Resource                    Requests      Limits
  cpu                         1920m (96%)   9300m (465%)
  memory                      3186Mi (40%)  5078Mi (65%)

ip-172-16-147-170.eu-central-1.compute.internal
  Resource                    Requests          Limits
  cpu                         1840m (92%)       15 (750%)
  memory                      1878706688 (22%)  9585354Ki (120%)
Was this page helpful?
0 / 5 - 0 ratings

Related issues

hjkatz picture hjkatz  路  4Comments

johanneswuerbach picture johanneswuerbach  路  5Comments

hprotzek picture hprotzek  路  5Comments

clamoriniere picture clamoriniere  路  5Comments

mboersma picture mboersma  路  6Comments