We do have a quite locked-down network in AWS (no internet-connectivity at all). Access to AWS services only via VPC endpoints and on-premise systems via DirectConnect. At the same time we would like to use IAM roles.
Kubernetes: 1.14
Cluster-Autoscaler: 1.14.7
When using cluster-autoscaler it cannot fetch credentials via STS using the IAM role. To my understanding the issue is caused by cluster-autoscaler not using the regional STS endpoint (https://sts.eu-central-1.amazonaws.com) but instead the global (https://sts.amazonaws.com). With VPC-Endpoints it is not possible to replace the global enpoint.
With github.com/aws/aws-sdk-go v1.25.18
(see https://github.com/aws/aws-sdk-go/blob/master/CHANGELOG.md) configuration of regional STS endpoints was introduced by setting env AWS_STS_REGIONAL_ENDPOINTS=regional
I tried setting the region as env AWS_REGION and AWS_STS_REGIONAL_ENDPOINTS but still the global endpoint is used.
After looking at the 1.14 branch for cluster-autoscaler, it looks like v1.23.22 is used (see
https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-release-1.14/cluster-autoscaler/vendor/github.com/aws/aws-sdk-go/CHANGELOG.md)
I also checked the other cluster-autoscaler branches:
So I would assume that supporting such a use case would be possible by upgrading the aws-sdk-go version to >= v1.25.18 - let me know if I can be of help.
logs:
E0213 16:05:54.390164 1 aws_manager.go:259] Failed to regenerate ASG cache: cannot autodiscover ASGs: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post https://sts.amazonaws.com/: dial tcp 54.239.29.25:443: i/o timeout
F0213 16:05:54.390200 1 aws_cloud_provider.go:330] Failed to create AWS Manager: cannot autodiscover ASGs: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post https://sts.amazonaws.com/: dial tcp 54.239.29.25:443: i/o timeout
Attached you can find the kubernetes deployment yaml.
deployment.txt
This has been updated in master 3 months ago, but no release. Any idea when a new release would be cut?
https://github.com/kubernetes/autoscaler/commit/af6f3258d6a1ebf7bf939ad2fc65de4cf8e2a9cb
/assign
Thanks. I can help resolve this issue and request newer release. I think what we can help is bump the SDK version and then user can mount env AWS_STS_REGIONAL_ENDPOINTS=regional. SDK client will pick up env and resolve right endpoint. Is that correct?
@Jeffwan That is correct, It's resolved in the version that is already in master.
- github.com/aws/aws-sdk-go v1.23.18
+ github.com/aws/aws-sdk-go v1.28.2
See go.mod in master here.
https://github.com/kubernetes/autoscaler/commit/af6f3258d6a1ebf7bf939ad2fc65de4cf8e2a9cb#diff-5b1211f36242f6afe85bdb0062369dc3R16
The upstream fix was here https://github.com/aws/aws-sdk-go/pull/2779/files
Resolved in aws-sdk-go "Release v1.25.18 (2019-10-23)"
https://github.com/aws/aws-sdk-go/blob/master/CHANGELOG.md#sdk-enhancements-13
Fixes #2532
@ajohnstone Thanks. I plan to have a few cherry-pick recently, I will make the change and include this in the new release.
Just make changes on 1.15. I will make the changes for rest of the version
em.. Sorry we only have following versions to support this case.
https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.15.6
https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.18.1
1.14, 1.16 and 1.17 changes is not included in this release. Change will be merged and you can build one image for short term. If you need any help, let me know
We are just migrating to 1.15, thanks a lot. If I can find time I will extend the AWS documentation.
Successful tested with 1.15.6, thanks again.
I opened https://github.com/kubernetes/autoscaler/pull/3052
From my side this could be closed, not sure if you want to keep it open until the other versions support it.
I will leave it open to track changes in other branches. @maust Thanks for the contribution. I will review the doc change
Hi @Jeffwan any plan to fix them in other version (1.16.x -> 1.17.x)?
thanks.
1.17.x change has been merged. I get some feedbacks on 1.16.x and I will fix it before next release. @haofeif
I also update PR and address feedbacks for 1.16 change. Once it get merged, next release will pick it up. https://github.com/kubernetes/autoscaler/pull/3003
Hello, is there any info on when this will be released in a 1.16.x or 1.17.x release? Thanks
1.16.6, 1.17.3 have been released. Please download latest version. I will close the issue. Thanks everyone for all your feedbacks
/close
@Jeffwan: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
1.17.x change has been merged. I get some feedbacks on 1.16.x and I will fix it before next release. @haofeif
I also update PR and address feedbacks for 1.16 change. Once it get merged, next release will pick it up. https://github.com/kubernetes/autoscaler/pull/3003