We are using the cluster autoscaler on AWS. It worked once, but right now, it doesn't seem to recognize our nodes as part of any node group and is skipping them. Logs look like this:
1 static_autoscaler.go:366] Calculating unneeded nodes
1 utils.go:543] Skipping ip-10-0-1-104.eu-central-1.compute.internal - no node group config
And the same for all other nodes as well.
As I said, it worked with the same configuration for us, but for context. We have cluster autoscaler deployed via helm, currently in chart version 6.2.0, which installs app-version 1.14.6. This should be fairly current. Our AWS Nodegroups are setup using eksctl and running Kubernetes 1.15, they are tagged with k8s.io/cluster-autoscaler/_name_: owned (and k8s.io/cluster-autoscaler/enabled: "true"). An eksctl get nodegroups does still succeed
As values we have setup autoDiscovery.enabled true and .clusterName to the same name as our EKS is named, alongside cloudProvider aws. The logs do not otherwise look problematic or different from what we're used to.
Just verified, we have the same issue in one of our clusters still on Kubernetes 1.14, so its not tied to EKS just adding support for Kubernetes 1.15
What may be relevant, is we have a nodegroup for each AZ, but as I said, it did use to work, and with the same settings as well.
Tried updating to newest Helm Chart 7.1.0 as well as balance-similar-node-groups: true, but problem persists.
Same here with latest chart: EKS 1.15, chart 7.2.0, CA 1.17.1 and with EKS 1.15, chart 7.0.0, CA 1.14.6. It appears there isn't a chart covering CA 1.15.x
it was my bad, I mistakenly add k8s labels instead of EC2 tags to the nodes.
Same here with latest chart: EKS 1.15, chart 7.2.0, CA 1.17.1 and with EKS 1.15, chart 7.0.0, CA 1.14.6. It appears there isn't a chart covering CA 1.15.x
When you combine CA 1.17.1 with EKS 1.15, I don't even get until that issue because the CSINode API group changed: Failed to list *v1.CSINode: the server could not find the requested resource
The documentation states that you need to use the same autoscaler minor version as your Kubernetes, means you need to run cluster autoscaler v1.15.6. However, I still get this issue using the latest chart (7.2.2) with the latest 1.15 autoscaler (1.15.6):
$ helm get values cluster-autoscaler --namespace kube-system
USER-SUPPLIED VALUES:
autoDiscovery:
clusterName: xxx
cloudProvider: aws
image:
repository: eu.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler
tag: v1.15.6
rbac:
create: true
serviceAccountAnnotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/ClusterAutoscaler-xxx
I'm also running into this same issue on Amazon EKS (k8s 1.14.9 and auto scaler version 1.14.6):
I0421 16:47:19.432967 1 static_autoscaler.go:138] Starting main loop
I0421 16:47:19.433139 1 utils.go:595] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0421 16:47:19.433151 1 static_autoscaler.go:294] Filtering out schedulables
I0421 16:47:19.433294 1 static_autoscaler.go:311] No schedulable pods
I0421 16:47:19.433311 1 static_autoscaler.go:319] No unschedulable pods
I0421 16:47:19.433322 1 static_autoscaler.go:366] Calculating unneeded nodes
I0421 16:47:19.433335 1 utils.go:543] Skipping ip-172-31-12-66.ec2.internal - no node group config
I0421 16:47:19.433344 1 utils.go:543] Skipping ip-172-31-11-148.ec2.internal - no node group config
I0421 16:47:19.433352 1 utils.go:543] Skipping ip-172-31-1-176.ec2.internal - no node group config
I0421 16:47:19.433360 1 utils.go:543] Skipping ip-172-31-2-26.ec2.internal - no node group config
I0421 16:47:19.433367 1 utils.go:543] Skipping ip-172-31-0-28.ec2.internal - no node group config
I0421 16:47:19.433374 1 utils.go:543] Skipping ip-172-31-12-231.ec2.internal - no node group config
I0421 16:47:19.433547 1 static_autoscaler.go:393] Scale down status: unneededOnly=false lastScaleUpTime=2020-04-21 15:46:09.809181357 +0000 UTC m=+17.819654332 lastScaleDownDeleteTime=2020-04-21 15:46:09.809181439 +0000 UTC m=+17.819654415 lastScaleDownFailTime=2020-04-21 15:46:09.809181535 +0000 UTC m=+17.819654509 scaleDownForbidden=false isDeleteInProgress=false
I0421 16:47:19.433604 1 static_autoscaler.go:403] Starting scale down
I0421 16:47:19.433651 1 scale_down.go:706] No candidates for scale down
I0421 16:47:20.523438 1 reflector.go:370] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:190: Watch close - *v1.Pod total 20 items received
Confirmed that the ASG and nodes themselves have the correct EC2 tags:
k8s.io/cluster-autoscaler/enabled: true
k8s.io/cluster/dev-eks-cluster
From the deployment spec:
spec:
containers:
- command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --namespace=kube-system
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/dev-eks-cluster
- --logtostderr=true
- --stderrthreshold=info
- --v=4
- --skip-nodes-with-system-pods=false
- --balance-similar-node-groups
- --skip-nodes-with-local-storage=false
env:
- name: AWS_REGION
value: us-east-1
image: k8s.gcr.io/cluster-autoscaler:v1.14.6
I just solved it for me: I forgot to set the AWS region. With the awsRegion parameter in my Helm values the combination above works fine.
Looking into this a bit more, it appears that fetchAutoAsgNames in auto_scaling_groups.go may not be returning any values (due to getAutoscalingGroupNamesByTags in auto_scaling.go), which is strange. This would indicate to me that there's either an issue with the AWS API (e.g. IAM permissions), the autoscale tag is somehow misconfigured on the nodes themselves, or the AWS SDK is somehow returning empty results when filtering by tag key.
If this were an IAM issue I would expect to see an error in the logs, which I do not, so I'm a bit perplexed without any deeper way to debug this.
It seems to work fine when I specify the node group (ASG) explicitly, via the --nodes flag
OK I realize the problem... 馃うx100. The ASG tag k8s.io/cluster/dev-eks-cluster actually had a whitespace character that I missed. This seems to be working properly once that was removed.
A minor suggestion would have better error (or warning) logging when getAutoscalingGroupNamesByTags returns no ASGs. When this function returns nil, to me it seems to indicate that somebody provided tags they expect to be on ASGs, but the AWS API could not find them.
Seeing the rather cryptic no node group config message took a bit of digging into the code to realize exactly what was going on.
/assign @Jeffwan
I'm running into the same issue, My autoscaling group has those tags, should my nodes (ec2-instances) also should have the tag?
@skadem07 try assigning them to the nodes, but ensure that the config on the k8s side matches the tags EXACTLY. Ensure the tags on the AWS side don't have any whitespace, etc.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@Jeffwan this is still an issue and I see it was assigned to you.
@PhilThurston what exactly is an issue?
For example this comment in this thread "solved" this for me:
I just solved it for me: I forgot to set the AWS region. With the awsRegion parameter in my Helm values the combination above works fine.
This cluster autoscaler is just kinda hard to get running right with these kind of shitty error messages
@matti I'm unsure what exactly fixed it for us but I can give you a rundown.
We are using eksctl to deploy the cluster. After setting up a cluster we installed the autoscaler and it would only recognize a single node-group of our three total as something that could be scaled. We would get the exact same errors as the original post here. I couldn't resolve the issue so I took a shot in the dark and recreated each node-group with just different names all other settings/zones/tags were the same. From that point on auto-scaler started scaling all three node-groups instead of just one and the error disappeared.
Keep in mind that the original node-groups were made about 15m before the autoscaler was installed and were unmodified from how eksctl installed them. Somehow just adding fresh node-groups post install resolved the issue but that doesn't make much sense since nothing else besides the node group name changed.
In the case that you use EC2 Auto Scaling groups, you will need to add the following tags (replace example-cluster-name with the name of the cluster):
k8s.io/cluster-autoscaler/example-cluster-name owned Yes
k8s.io/cluster-autoscaler/enabled true Yes

If you use terraform add this to your aws_autoscaling_group config
resource "aws_autoscaling_group" "example-eks-nodes-" {
...
tag {
key = "k8s.io/cluster-autoscaler/example-cluster-name"
value = "owned"
propagate_at_launch = true
}
tag {
key = "k8s.io/cluster-autoscaler/enabled"
value = "true"
propagate_at_launch = true
}
I hope it helps. Regards.
I just solved it for me: I forgot to set the AWS region. With the awsRegion parameter in my Helm values the combination above works fine.
Is that helm chart specific thing? I don't see that env is mentioned anywhere in docs.
Our ec2 instances have correct tags
k8s.io/cluster-autoscaler/my-cluster: owned
k8s.io/cluster-autoscaler/enabled: true
Still seeing exact same error.
I0316 13:47:31.499205 1 static_autoscaler.go:449] Calculating unneeded nodes
I0316 13:47:31.499215 1 pre_filtering_processor.go:57] Skipping ip-172-27-10-15.ec2.internal - no node group config
I0316 13:47:31.499223 1 pre_filtering_processor.go:57] Skipping ip-172-27-11-201.ec2.internal - no node group config
...
I'm seeing the exact same error as shinebayar-g, and the tags are set correctly, but can't figure out why this happens
Most helpful comment
I just solved it for me: I forgot to set the AWS region. With the awsRegion parameter in my Helm values the combination above works fine.