Autoscaler: Autoscaler doesn't recognize instances as part of a node group

Created on 16 Mar 2020 · 22Comments · Source: kubernetes/autoscaler

We are using the cluster autoscaler on AWS. It worked once, but right now, it doesn't seem to recognize our nodes as part of any node group and is skipping them. Logs look like this:

1 static_autoscaler.go:366] Calculating unneeded nodes
1 utils.go:543] Skipping ip-10-0-1-104.eu-central-1.compute.internal - no node group config

And the same for all other nodes as well.
As I said, it worked with the same configuration for us, but for context. We have cluster autoscaler deployed via helm, currently in chart version 6.2.0, which installs app-version 1.14.6. This should be fairly current. Our AWS Nodegroups are setup using eksctl and running Kubernetes 1.15, they are tagged with k8s.io/cluster-autoscaler/_name_: owned (and k8s.io/cluster-autoscaler/enabled: "true"). An eksctl get nodegroups does still succeed
As values we have setup autoDiscovery.enabled true and .clusterName to the same name as our EKS is named, alongside cloudProvider aws. The logs do not otherwise look problematic or different from what we're used to.

lifecyclrotten

Source

autarchprinceps

👍6

Most helpful comment

I just solved it for me: I forgot to set the AWS region. With the awsRegion parameter in my Helm values the combination above works fine.

hendrikhalkow on 21 Apr 2020

👍5

All 22 comments

Just verified, we have the same issue in one of our clusters still on Kubernetes 1.14, so its not tied to EKS just adding support for Kubernetes 1.15

autarchprinceps on 16 Mar 2020

What may be relevant, is we have a nodegroup for each AZ, but as I said, it did use to work, and with the same settings as well.

autarchprinceps on 16 Mar 2020

Tried updating to newest Helm Chart 7.1.0 as well as balance-similar-node-groups: true, but problem persists.

autarchprinceps on 16 Mar 2020

Same here with latest chart: EKS 1.15, chart 7.2.0, CA 1.17.1 and with EKS 1.15, chart 7.0.0, CA 1.14.6. It appears there isn't a chart covering CA 1.15.x

JoaquinScript on 2 Apr 2020

it was my bad, I mistakenly add k8s labels instead of EC2 tags to the nodes.

JoaquinScript on 3 Apr 2020

Same here with latest chart: EKS 1.15, chart 7.2.0, CA 1.17.1 and with EKS 1.15, chart 7.0.0, CA 1.14.6. It appears there isn't a chart covering CA 1.15.x

When you combine CA 1.17.1 with EKS 1.15, I don't even get until that issue because the CSINode API group changed: Failed to list *v1.CSINode: the server could not find the requested resource

The documentation states that you need to use the same autoscaler minor version as your Kubernetes, means you need to run cluster autoscaler v1.15.6. However, I still get this issue using the latest chart (7.2.2) with the latest 1.15 autoscaler (1.15.6):

$ helm get values cluster-autoscaler --namespace kube-system               
USER-SUPPLIED VALUES:
autoDiscovery:
  clusterName: xxx
cloudProvider: aws
image:
  repository: eu.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler
  tag: v1.15.6
rbac:
  create: true
  serviceAccountAnnotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/ClusterAutoscaler-xxx

hendrikhalkow on 16 Apr 2020

👍1

I'm also running into this same issue on Amazon EKS (k8s 1.14.9 and auto scaler version 1.14.6):

I0421 16:47:19.432967       1 static_autoscaler.go:138] Starting main loop
I0421 16:47:19.433139       1 utils.go:595] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0421 16:47:19.433151       1 static_autoscaler.go:294] Filtering out schedulables
I0421 16:47:19.433294       1 static_autoscaler.go:311] No schedulable pods
I0421 16:47:19.433311       1 static_autoscaler.go:319] No unschedulable pods
I0421 16:47:19.433322       1 static_autoscaler.go:366] Calculating unneeded nodes
I0421 16:47:19.433335       1 utils.go:543] Skipping ip-172-31-12-66.ec2.internal - no node group config
I0421 16:47:19.433344       1 utils.go:543] Skipping ip-172-31-11-148.ec2.internal - no node group config
I0421 16:47:19.433352       1 utils.go:543] Skipping ip-172-31-1-176.ec2.internal - no node group config
I0421 16:47:19.433360       1 utils.go:543] Skipping ip-172-31-2-26.ec2.internal - no node group config
I0421 16:47:19.433367       1 utils.go:543] Skipping ip-172-31-0-28.ec2.internal - no node group config
I0421 16:47:19.433374       1 utils.go:543] Skipping ip-172-31-12-231.ec2.internal - no node group config
I0421 16:47:19.433547       1 static_autoscaler.go:393] Scale down status: unneededOnly=false lastScaleUpTime=2020-04-21 15:46:09.809181357 +0000 UTC m=+17.819654332 lastScaleDownDeleteTime=2020-04-21 15:46:09.809181439 +0000 UTC m=+17.819654415 lastScaleDownFailTime=2020-04-21 15:46:09.809181535 +0000 UTC m=+17.819654509 scaleDownForbidden=false isDeleteInProgress=false
I0421 16:47:19.433604       1 static_autoscaler.go:403] Starting scale down
I0421 16:47:19.433651       1 scale_down.go:706] No candidates for scale down
I0421 16:47:20.523438       1 reflector.go:370] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:190: Watch close - *v1.Pod total 20 items received

Confirmed that the ASG and nodes themselves have the correct EC2 tags:

k8s.io/cluster-autoscaler/enabled: true
k8s.io/cluster/dev-eks-cluster

From the deployment spec:

spec:
  containers:
  - command:
    - ./cluster-autoscaler
    - --cloud-provider=aws
    - --namespace=kube-system
    - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/dev-eks-cluster
    - --logtostderr=true
    - --stderrthreshold=info
    - --v=4
    - --skip-nodes-with-system-pods=false
    - --balance-similar-node-groups
    - --skip-nodes-with-local-storage=false
    env:
    - name: AWS_REGION
      value: us-east-1
    image: k8s.gcr.io/cluster-autoscaler:v1.14.6

ccampo133 on 21 Apr 2020

I just solved it for me: I forgot to set the AWS region. With the awsRegion parameter in my Helm values the combination above works fine.

hendrikhalkow on 21 Apr 2020

👍5

Looking into this a bit more, it appears that fetchAutoAsgNames in auto_scaling_groups.go may not be returning any values (due to getAutoscalingGroupNamesByTags in auto_scaling.go), which is strange. This would indicate to me that there's either an issue with the AWS API (e.g. IAM permissions), the autoscale tag is somehow misconfigured on the nodes themselves, or the AWS SDK is somehow returning empty results when filtering by tag key.

~~If this were an IAM issue I would expect to see an error in the logs, which I do not, so I'm a bit perplexed without any deeper way to debug this.~~

~~It seems to work fine when I specify the node group (ASG) explicitly, via the --nodes flag~~

OK I realize the problem... 🤦x100. The ASG tag k8s.io/cluster/dev-eks-cluster actually had a whitespace character that I missed. This seems to be working properly once that was removed.

A minor suggestion would have better error (or warning) logging when getAutoscalingGroupNamesByTags returns no ASGs. When this function returns nil, to me it seems to indicate that somebody provided tags they expect to be on ASGs, but the AWS API could not find them.

Seeing the rather cryptic no node group config message took a bit of digging into the code to realize exactly what was going on.

ccampo133 on 21 Apr 2020

/assign @Jeffwan

Jeffwan on 30 Apr 2020

I'm running into the same issue, My autoscaling group has those tags, should my nodes (ec2-instances) also should have the tag?

skadem07 on 18 May 2020

@skadem07 try assigning them to the nodes, but ensure that the config on the k8s side matches the tags EXACTLY. Ensure the tags on the AWS side don't have any whitespace, etc.

ccampo133 on 18 May 2020

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 16 Aug 2020

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 15 Sep 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 15 Oct 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 15 Oct 2020

@Jeffwan this is still an issue and I see it was assigned to you.

PhilThurston on 12 Dec 2020

@PhilThurston what exactly is an issue?

For example this comment in this thread "solved" this for me:

I just solved it for me: I forgot to set the AWS region. With the awsRegion parameter in my Helm values the combination above works fine.

This cluster autoscaler is just kinda hard to get running right with these kind of shitty error messages

matti on 4 Jan 2021

@matti I'm unsure what exactly fixed it for us but I can give you a rundown.

We are using eksctl to deploy the cluster. After setting up a cluster we installed the autoscaler and it would only recognize a single node-group of our three total as something that could be scaled. We would get the exact same errors as the original post here. I couldn't resolve the issue so I took a shot in the dark and recreated each node-group with just different names all other settings/zones/tags were the same. From that point on auto-scaler started scaling all three node-groups instead of just one and the error disappeared.

Keep in mind that the original node-groups were made about 15m before the autoscaler was installed and were unmodified from how eksctl installed them. Somehow just adding fresh node-groups post install resolved the issue but that doesn't make much sense since nothing else besides the node group name changed.

PhilThurston on 4 Jan 2021

In the case that you use EC2 Auto Scaling groups, you will need to add the following tags (replace example-cluster-name with the name of the cluster):

k8s.io/cluster-autoscaler/example-cluster-name owned Yes
k8s.io/cluster-autoscaler/enabled true Yes

If you use terraform add this to your aws_autoscaling_group config

resource "aws_autoscaling_group" "example-eks-nodes-" {
...
tag {
key = "k8s.io/cluster-autoscaler/example-cluster-name"
value = "owned"
propagate_at_launch = true
}

tag {
key = "k8s.io/cluster-autoscaler/enabled"
value = "true"
propagate_at_launch = true
}

I hope it helps. Regards.

franklinoa on 5 Jan 2021

I just solved it for me: I forgot to set the AWS region. With the awsRegion parameter in my Helm values the combination above works fine.

Is that helm chart specific thing? I don't see that env is mentioned anywhere in docs.

Our ec2 instances have correct tags

k8s.io/cluster-autoscaler/my-cluster: owned
k8s.io/cluster-autoscaler/enabled: true

Still seeing exact same error.

I0316 13:47:31.499205       1 static_autoscaler.go:449] Calculating unneeded nodes
I0316 13:47:31.499215       1 pre_filtering_processor.go:57] Skipping ip-172-27-10-15.ec2.internal - no node group config
I0316 13:47:31.499223       1 pre_filtering_processor.go:57] Skipping ip-172-27-11-201.ec2.internal - no node group config
...