Apologies if I'm missing something posted in previous issues, but I have tried to go through them all and nothing seems to be working.
We are running into an issue where the autoscaler is refusing to spin up a new node on Amazon EKS because it thinks that the pod wouldn't fit on the new node.
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 68s (x3 over 68s) default-scheduler 0/2 nodes are available: 2 Insufficient cpu.
Normal NotTriggerScaleUp 9s (x6 over 60s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added)
I have checked daemonsets and the autoscaler logs but everything seems to be fine. By default, we are already able to fit two instances of the application on the nodes before they go into Pending and unschedulable.
The error message that I'm seeing must be misleading because I spun up a new worker node manually and it instantly fixes the unschedulable pods and everything is good.
Please let me know of any additional information I can provide to help debug the issue.
Can you post the launch configuration/template of your auto scaling group and your pod/deployment manifest? The details you provided are not enough to dig into this.
/sig aws
@Jeffwan: The label(s) sig/aws cannot be appled. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other
In response to this:
/sig aws
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/platform aws
I've seen something similar to this - which version of the cluster autoscaler are you running, and which version of k8s is your cluster? I think when the versions are misaligned you can see something like this.
I managed to figure out the issue. @Denton24646 happened to be correct that it was a version mismatch between the cluster autoscaler and the cluster. A breaking change in what auto discovery tags it was looking for. k8s.io vs kubernetes.io.
@bspradling what version was that? I'm currently hitting this but not on EKS.
I ran into this last night - in case this helps anyone else, in my case it was because I was trying to launch a process with a 2 cpu request onto an m5.large (which has 2 cpus), but I'd forgotten to take into account that my kube-proxy daemonset has a 0.1 cpu request, and my aws-node daemonset has a 0.01 cpu request, and 2.11 > 2. :P
@jwalton nice finding! In this case, we would suggest to reserve compute resources for system daemons and allocatable resources would be more accurate.
I also found
kubectl get nodes --no-headers | awk '{print $1}' | xargs -I {} sh -c 'echo {}; kubectl describe node {} | grep Allocated -A 5 | grep -ve Event -ve Allocated -ve percent -ve -- ; echo'
from https://jaxenter.com/manage-container-resource-kubernetes-141977.html which produces:
ip-172-16-122-95.eu-central-1.compute.internal
Resource Requests Limits
cpu 1920m (96%) 9300m (465%)
memory 3186Mi (40%) 5078Mi (65%)
ip-172-16-147-170.eu-central-1.compute.internal
Resource Requests Limits
cpu 1840m (92%) 15 (750%)
memory 1878706688 (22%) 9585354Ki (120%)
Most helpful comment
I also found
kubectl get nodes --no-headers | awk '{print $1}' | xargs -I {} sh -c 'echo {}; kubectl describe node {} | grep Allocated -A 5 | grep -ve Event -ve Allocated -ve percent -ve -- ; echo'from https://jaxenter.com/manage-container-resource-kubernetes-141977.html which produces: