Autoscaler: AWS Cluster Autoscaler Permissions

Created on 10 Jun 2017 · 17Comments · Source: kubernetes/autoscaler

Using v0.5.4 of the aws-cluster-autoscaler, we're getting this error:

E0609 23:20:59.162974       1 static_autoscaler.go:108] Failed to update node registry: Unable to get first autoscaling.Group for node-us-west-2a.dev.clusters.mydomain.io

It sure looks like a permission problem... But per the instructions, I have the following policy on my instance role named nodes.dev.clusters.mydomain.io:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup"
            ],
            "Resource": "*"
        }
    ]
}

Without this addition, I get a different error:

E0609 23:05:48.475214       1 static_autoscaler.go:108] Failed to update node registry: AccessDenied: User: arn:aws:sts::11111111111:assumed-role/nodes.dev.clusters.mydomain.io/i-0472257b3f8d4ec43 is not authorized to perform: autoscaling:DescribeAutoScalingGroups
    status code: 403, request id: 2cf17af0-4d68-11e7-825c-73c99354b20d

So we're thinking that we have the necessary permissions.

For reference here's our execution config:

./cluster-autoscaler
--cloud-provider=aws
--nodes=1:10:node-us-west-2a.dev.clusters.mydomain.io
--nodes=1:10:node-us-west-2b.dev.clusters.mydomain.io
--nodes=1:10:node-us-west-2c.dev.clusters.mydomain.io
--scale-down-delay=10m
--skip-nodes-with-local-storage=false
--skip-nodes-with-system-pods=true
--v=4

Any ideas on what to do?
Is there any strategy for debugging this?

areprovideaws cluster-autoscaler

Source

pluttrell

👍1

Most helpful comment

@srossross-tableau can you confirm that the original request is including the region like you have in the aws call from in the container?

You might need to make sure your env is set correctly.

env:
- name: AWS_REGION
  value: us-west-2

christopherhein on 16 May 2018

👍13 🎉5 🚀2

All 17 comments

Judging by the code from https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L114 it looks like you've passed an incorrect group name.

zaa on 12 Jun 2017

👍2

@pluttrell Was it a problem with the group name?

mwielgus on 14 Jun 2017

Nope, the group names were identical to what was in AWS.

We do however have the aws-cluster-autoscaler working perfectly with just using the kubernetes resource files directly without helm, so we've gone with that option for now.

pluttrell on 16 Jun 2017

Great :). Closing the bug.

mwielgus on 16 Jun 2017

Getting a similar error, with kops 1.7.0, kubernetes 1.7.5, cluster-autoscaler 0.6.1, but only when trying to scale from 0 nodes. According to this, as of CA 0.6.1 I should be able to scale to/from 0. I'm getting errors like this:

E0908 03:18:13.511590       1 static_autoscaler.go:118] Failed to update node registry: RequestError: send request failed
caused by: Post https://autoscaling.us-west-2.amazonaws.com/: dial tcp: i/o timeout

Using a deployment similar to this one, and it works as long as there is at least 1 node up:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      containers:
        - image: gcr.io/google_containers/cluster-autoscaler:v0.6.1
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --nodes=0:10:nodes.uswest2.metamoto.net
          env:
            - name: AWS_REGION
              value: us-west-2
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-certificates.crt"
      tolerations:
        - key: "node-role.kubernetes.io/master"
          effect: NoSchedule

7chenko on 8 Sep 2017

Figured this out, it was the fact that a kube-dns pod was not running on the master node. To run it, had to add the master toleration to the kube-dns deployment (same as with cluster-autoscaler deployment above). Once kube-dns was running on the master, autoscaler was able to use it to get ASG info from AWS and scale up from 0 nodes.

7chenko on 8 Sep 2017

👍1

Curious does cluster-autoscaler depend on in-cluster DNS service? Probably not?

Instead of putting kube-dns on master, what about setting dnsPolicy: Default for cluster-autoscaler so that the name resolution does not go through kube-dns?

Using dnsPolicy: ClusterFirst on pods that run on master node might not work unless kube-proxy pod also runs on master (for Service VIP -> backend Pods routing), which isn't always true (e.g. in GCE kube-up it doesn't).

MrHohn on 1 Nov 2017

@MrHohn @7chenko @StevenACoffman i have tried

running both cluster-autoscaler & kube-dns on master
using dnsPolicy: Default for cluster-autoscaler

Still im getting this error

Failed to update node registry: RequestError: send request failed
caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp .*..:443: i/o timeout

Please suggest

shiv9012 on 14 Mar 2018

Failed to update node registry: RequestError: send request failed
caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp ...*:443: i/o timeout

This looks like a routing or firewall issue instead..

MrHohn on 9 Apr 2018

I'm getting the original error posted Failed to update node registry: Unable to get first autoscaling.Group nodes.public-prod.k8s.local

What steps can I take to debug and fix this?

srossross-tableau on 9 May 2018

I think that I have the correct AWS permissions to describe the autoscaling groups

If I exec into the cluster-autoscaler pod and install the aws cli. I can run:

aws --region us-west-2 autoscaling describe-auto-scaling-groups | grep nodes
            "AutoScalingGroupARN": "arn:aws:autoscaling:us-west-2:***:autoScalingGroup:****:autoScalingGroupName/nodes.public-prod.k8s.local",

srossross-tableau on 9 May 2018

Briefly looking at the code, it seems that AWS returns no groups with this name. Based on the error message, method is called with correct group name.

I'm unable to replicate or debug it, but I guess if you get different results for requests made by Go library and command line tool, maintainers of those tools may be better able to help.

aleksandra-malinowska on 10 May 2018

@srossross-tableau can you confirm that the original request is including the region like you have in the aws call from in the container?

You might need to make sure your env is set correctly.

env:
- name: AWS_REGION
  value: us-west-2

christopherhein on 16 May 2018

👍13 🎉5 🚀2

Thanks @christopherhein that was the issue.

srossross-tableau on 16 May 2018

👍1

Curious does cluster-autoscaler depend on in-cluster DNS service? Probably not?

Instead of putting kube-dns on master, what about setting dnsPolicy: Default for cluster-autoscaler so that the name resolution does not go through kube-dns?

Using dnsPolicy: ClusterFirst on pods that run on master node might not work unless kube-proxy pod also runs on master (for Service VIP -> backend Pods routing), which isn't always true (e.g. in GCE kube-up it doesn't).

I tested this and feel this is the best approach. It keeps you from having to modify the kube-dns deployment while keeping your masters clean. Thanks!!

dthomason on 31 Dec 2018

👍2

Curious does cluster-autoscaler depend on in-cluster DNS service? Probably not?
Instead of putting kube-dns on master, what about setting dnsPolicy: Default for cluster-autoscaler so that the name resolution does not go through kube-dns?
Using dnsPolicy: ClusterFirst on pods that run on master node might not work unless kube-proxy pod also runs on master (for Service VIP -> backend Pods routing), which isn't always true (e.g. in GCE kube-up it doesn't).

I tested this and feel this is the best approach. It keeps you from having to modify the kube-dns deployment while keeping your masters clean. Thanks!!

Setting dnsPolicy: Default worked for me too on EKS 1.13

gazal-k on 21 Oct 2019

👍1

Curious does cluster-autoscaler depend on in-cluster DNS service? Probably not?
Instead of putting kube-dns on master, what about setting dnsPolicy: Default for cluster-autoscaler so that the name resolution does not go through kube-dns?
Using dnsPolicy: ClusterFirst on pods that run on master node might not work unless kube-proxy pod also runs on master (for Service VIP -> backend Pods routing), which isn't always true (e.g. in GCE kube-up it doesn't).

I tested this and feel this is the best approach. It keeps you from having to modify the kube-dns deployment while keeping your masters clean. Thanks!!

Setting dnsPolicy: Default worked for me too on EKS 1.13

I met the same error on EKS 1.13, you helped me a lot, Thank you very much @gazal-k