Using v0.5.4 of the aws-cluster-autoscaler, we're getting this error:
E0609 23:20:59.162974 1 static_autoscaler.go:108] Failed to update node registry: Unable to get first autoscaling.Group for node-us-west-2a.dev.clusters.mydomain.io
It sure looks like a permission problem... But per the instructions, I have the following policy on my instance role named nodes.dev.clusters.mydomain.io:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup"
],
"Resource": "*"
}
]
}
Without this addition, I get a different error:
E0609 23:05:48.475214 1 static_autoscaler.go:108] Failed to update node registry: AccessDenied: User: arn:aws:sts::11111111111:assumed-role/nodes.dev.clusters.mydomain.io/i-0472257b3f8d4ec43 is not authorized to perform: autoscaling:DescribeAutoScalingGroups
status code: 403, request id: 2cf17af0-4d68-11e7-825c-73c99354b20d
So we're thinking that we have the necessary permissions.
For reference here's our execution config:
./cluster-autoscaler
--cloud-provider=aws
--nodes=1:10:node-us-west-2a.dev.clusters.mydomain.io
--nodes=1:10:node-us-west-2b.dev.clusters.mydomain.io
--nodes=1:10:node-us-west-2c.dev.clusters.mydomain.io
--scale-down-delay=10m
--skip-nodes-with-local-storage=false
--skip-nodes-with-system-pods=true
--v=4
Any ideas on what to do?
Is there any strategy for debugging this?
Judging by the code from https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L114 it looks like you've passed an incorrect group name.
@pluttrell Was it a problem with the group name?
Nope, the group names were identical to what was in AWS.
We do however have the aws-cluster-autoscaler working perfectly with just using the kubernetes resource files directly without helm, so we've gone with that option for now.
Great :). Closing the bug.
Getting a similar error, with kops 1.7.0, kubernetes 1.7.5, cluster-autoscaler 0.6.1, but only when trying to scale from 0 nodes. According to this, as of CA 0.6.1 I should be able to scale to/from 0. I'm getting errors like this:
E0908 03:18:13.511590 1 static_autoscaler.go:118] Failed to update node registry: RequestError: send request failed
caused by: Post https://autoscaling.us-west-2.amazonaws.com/: dial tcp: i/o timeout
Using a deployment similar to this one, and it works as long as there is at least 1 node up:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
app: cluster-autoscaler
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
containers:
- image: gcr.io/google_containers/cluster-autoscaler:v0.6.1
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --nodes=0:10:nodes.uswest2.metamoto.net
env:
- name: AWS_REGION
value: us-west-2
volumeMounts:
- name: ssl-certs
mountPath: /etc/ssl/certs/ca-certificates.crt
readOnly: true
imagePullPolicy: "Always"
volumes:
- name: ssl-certs
hostPath:
path: "/etc/ssl/certs/ca-certificates.crt"
tolerations:
- key: "node-role.kubernetes.io/master"
effect: NoSchedule
Figured this out, it was the fact that a kube-dns pod was not running on the master node. To run it, had to add the master toleration to the kube-dns deployment (same as with cluster-autoscaler deployment above). Once kube-dns was running on the master, autoscaler was able to use it to get ASG info from AWS and scale up from 0 nodes.
Curious does cluster-autoscaler depend on in-cluster DNS service? Probably not?
Instead of putting kube-dns on master, what about setting dnsPolicy: Default for cluster-autoscaler so that the name resolution does not go through kube-dns?
Using dnsPolicy: ClusterFirst on pods that run on master node might not work unless kube-proxy pod also runs on master (for Service VIP -> backend Pods routing), which isn't always true (e.g. in GCE kube-up it doesn't).
@MrHohn @7chenko @StevenACoffman i have tried
dnsPolicy: Default for cluster-autoscaler Still im getting this error
Failed to update node registry: RequestError: send request failed
caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp .*..:443: i/o timeout
Please suggest
Failed to update node registry: RequestError: send request failed
caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp ...*:443: i/o timeout
This looks like a routing or firewall issue instead..
I'm getting the original error posted Failed to update node registry: Unable to get first autoscaling.Group nodes.public-prod.k8s.local
What steps can I take to debug and fix this?
I think that I have the correct AWS permissions to describe the autoscaling groups
If I exec into the cluster-autoscaler pod and install the aws cli. I can run:
aws --region us-west-2 autoscaling describe-auto-scaling-groups | grep nodes
"AutoScalingGroupARN": "arn:aws:autoscaling:us-west-2:***:autoScalingGroup:****:autoScalingGroupName/nodes.public-prod.k8s.local",
Briefly looking at the code, it seems that AWS returns no groups with this name. Based on the error message, method is called with correct group name.
I'm unable to replicate or debug it, but I guess if you get different results for requests made by Go library and command line tool, maintainers of those tools may be better able to help.
@srossross-tableau can you confirm that the original request is including the region like you have in the aws call from in the container?
You might need to make sure your env is set correctly.
env:
- name: AWS_REGION
value: us-west-2
Thanks @christopherhein that was the issue.
Curious does cluster-autoscaler depend on in-cluster DNS service? Probably not?
Instead of putting kube-dns on master, what about setting
dnsPolicy: Defaultfor cluster-autoscaler so that the name resolution does not go through kube-dns?Using
dnsPolicy: ClusterFirston pods that run on master node might not work unless kube-proxy pod also runs on master (for Service VIP -> backend Pods routing), which isn't always true (e.g. in GCE kube-up it doesn't).
I tested this and feel this is the best approach. It keeps you from having to modify the kube-dns deployment while keeping your masters clean. Thanks!!
Curious does cluster-autoscaler depend on in-cluster DNS service? Probably not?
Instead of putting kube-dns on master, what about settingdnsPolicy: Defaultfor cluster-autoscaler so that the name resolution does not go through kube-dns?
UsingdnsPolicy: ClusterFirston pods that run on master node might not work unless kube-proxy pod also runs on master (for Service VIP -> backend Pods routing), which isn't always true (e.g. in GCE kube-up it doesn't).I tested this and feel this is the best approach. It keeps you from having to modify the kube-dns deployment while keeping your masters clean. Thanks!!
Setting dnsPolicy: Default worked for me too on EKS 1.13
Curious does cluster-autoscaler depend on in-cluster DNS service? Probably not?
Instead of putting kube-dns on master, what about settingdnsPolicy: Defaultfor cluster-autoscaler so that the name resolution does not go through kube-dns?
UsingdnsPolicy: ClusterFirston pods that run on master node might not work unless kube-proxy pod also runs on master (for Service VIP -> backend Pods routing), which isn't always true (e.g. in GCE kube-up it doesn't).I tested this and feel this is the best approach. It keeps you from having to modify the kube-dns deployment while keeping your masters clean. Thanks!!
Setting
dnsPolicy: Defaultworked for me too on EKS 1.13
I met the same error on EKS 1.13, you helped me a lot, Thank you very much @gazal-k
Most helpful comment
@srossross-tableau can you confirm that the original request is including the
regionlike you have in theawscall from in the container?You might need to make sure your
envis set correctly.