Terraform: AWS rate limiting APIs on TF resource refresh?

Created on 24 Mar 2016 · 16Comments · Source: hashicorp/terraform

I use TF on AWS with tokens for MFA. This usually works well. Recently, I have run into many token forbidden/403 errors. After vetting all of my MFA bits and pieces, I eventually figured out I could work around the issue with -target. Fast forward a little more, and I think I have an idea of what's going on.

If I use -target, I am limiting the number of resources Terraform is working with. If I do not use -target, TF is working with a lot of resources. This environment had been relatively reasonable in size, but I then imported DNS and IAM into the environment and I think that is when the issues started happening. I am not sure how many resources are in this environment, but if I run terraform plan and let TF run the checkin with AWS for each resource, TF will eventually error out with:

Error refreshing state: 1 error(s) occurred:

* 1 error(s) occurred:

* InvalidClientTokenId: The security token included in the request is invalid.
    status code: 403, request id: ...

In other words.. that token is valid and works to refresh many, many resources, but at some point it it too much and the token is not valid.. or is it just that the request is denied (403). I think this an issue with rate limiting, but I am not sure how to confirm.

If I use tfp | grep "Refreshing state..." | wc -l I can estimate the number of resources that go through before the requests are denied, and I consistently get 167. I got 168 once, but 167 all others (in 6 tests).

I have seen this issue with 0.6.11, .12, and I just now confirmed .13 as well.

Please let me know if there is any other info I might be able to provide that would be helpful in resolving this issue.

provideaws question upstream

Source

ketzacoatl

Most helpful comment

I'm coming in hot here insofar as I haven't actually dug through the code, but... is there a reason why TF makes a separate API call for each instance? For those of us with thousands of instances, "refreshing state" on each individual instance takes MUCH longer than doing a single describe-instances call would take.

So... why not collect the instance IDs and make a single API call for all of them? (And same for security groups etc).

tarrall on 7 Jun 2016

👍2

All 16 comments

Hi @ketzacoatl - in .13 and later the raw AWS API request/response details are dumped into the debug log, so if you reproduce your issue with a debug log you should be able to search this log for the exact response from the AWS API which might help shed some light on this. Once you have the raw API behavior that might give you enough information to go to AWS Support with questions about what to do.

On the Terraform side, we do expose -parallelism as a flag that you can reduce, which will make Terraform hit the API with less concurrency.

phinze on 25 Mar 2016

OK, I am running .13 now, so I ought to be able to retrieve details from that debug log as you describe.

Your note about -parallelism is a little short, but I see -parallelism=n Limit the number of concurrent operations. Defaults to 10. in the docs, so that makes sense. Thanks for the reply.

ketzacoatl on 25 Mar 2016

Using -parallelism=5, I was able to get further (thru 186 and 188 resources, in two tests), but that still errors out with the 403 forbidden/bad token.

I can see the debug log (I enabled trace with export TF_LOG=TRACE), but it does not tell us any more about the error (at least that I can see). I can see a whole bunch of HTTP 200 responses with full XML for the resources, and then AWS returns a 403 for the response, and TF errors out with ..... [DEBUG] vertex root* InvalidClientTokenId: The security token included in the request is invalid.

ketzacoatl on 25 Mar 2016

@ketzacoatl okay cool - sounds like that's the data to take to AWS support to figure out why they 403 after a certain number of requests. AWS has plenty of request limit response messages that are handled gracefully in the SDK with backoff/retry logic - the fact that they 403 here feels like an API bug.

phinze on 25 Mar 2016

OK, I will hit up AWS support for an explanation.

ketzacoatl on 25 Mar 2016

I've gotten zero help from AWS, and I'm completely stuck. I am also wondering.. if there is backoff/retry logic in the SDK, couldn't this also be a bug in that logic? I'm not sure why we would not want to confirm the SDK is not a part of this. Should I file a bug with the go SDK?

At least for Terraform, I believe my only workable option with this env is to edit the TF state file and manually shuffle the env to refactor it into multiple. Maybe you have a better option to recommend?

ketzacoatl on 28 Mar 2016

I am also wondering.. if there is backoff/retry logic in the SDK, couldn't this also be a bug in that logic?

Yep there is, based on my review of the code in the area it doesn't look like it will retry 403s. It does feel like an API bug that you're getting back 403s and not any of the throttle / ratelimit specific errors they clearly support.

At least for Terraform, I believe my only workable option with this env is to edit the TF state file and manually shuffle the env to refactor it into multiple. Maybe you have a better option to recommend?

Splitting up your TF envs so less API calls are made would definitely help. Have you tried switching authentication mechanisms temporarily to see if you get different behavior? Perhaps this is strictly an MFA token related issue and you can work around it by avoiding that auth type?

phinze on 28 Mar 2016

OK, that is a good point, I will disable MFA and see if that sheds any light on the situation.

ketzacoatl on 29 Mar 2016

Nope, that yielded the same error at roughly the same place as before (after 173 resources):

$ tfp | grep "Refreshing state" | wc -l
Error refreshing state: 1 error(s) occurred:

* 1 error(s) occurred:

* InvalidClientTokenId: The security token included in the request is invalid.
    status code: 403, request id: 6e724da7-f533-11e5-9fe0-e36f5e3d3ee0
173

Splitting the env up would be cumbersome, but it seems like the best workaround ATM.

ketzacoatl on 29 Mar 2016

Hm that's very odd! And this is using an AWS Access Key ID / Secret Access Key? I've never seen those yield 403 after a certain number of uses before.

phinze on 29 Mar 2016

I have spent some time to split off DNS, IAM, and S3 from the core of the environment, bringing the core down to a level that is on par with other environments (WRT number of resources), and I still see the error, so this is not likely a rate limiting issue, but something else. I have been looking at Terraform's trace log in an effort to determine what resource is failing. I suspect it is the same resource, but I'm not sure how to confirm. This issue seems to be related more to this environment than some general bug or issue with AWS' APIs. AWS support has not helpful, the TF logs do not provide enough info, and I am otherwise stuck here unsure how to resolve this blockage and go back to happily using TF with this environment. Please help :(

ketzacoatl on 3 Apr 2016

@phinze, I re-visited my test disabling MFA - I created a new user and was able to confirm TF with IAM, no MFA, gets through all resources, and succeeds with no error. It is when I have MFA enabled that I see the 403.

To review:

When I use MFA and a token, TF plan/apply succeed if I use -target to limit the operation, but fail when I let do not.
If no MFA is in use, TF plan/apply succeed, even without limiting the operation with -target

ketzacoatl on 6 Apr 2016

Using the IAM user without MFA, I attempted to revert back to the single environment with all resources together in one state file. This seemed easy enough, but somewhere along the way, an even stranger issue has come up. When I go to run tf plan, TF asks me for the region variable, which I do not expect, because everything is configured and in place, TF should not be asking me for the region. When I provide the region, TF does not ask for any other variables, but goes to refresh the resources. The refresh eventually completes, but errors out before exiting with:

Error refreshing state: 1 error(s) occurred:

* 1 error(s) occurred:

* No valid credential sources found for AWS Provider. 
  Please see https://terraform.io/docs/providers/aws/index.html for more information on 
  providing credentials for the AWS Provider

Auth with AWS with with access_key and secrete_key, no token. When I look in IAM, I see "last access" with details that confirm the creds work, and that the refresh is working with AWS.

When I run tf plan with debug logging enabled, I get

2016/04/10 23:43:55 [DEBUG] vertex provider.module.cleaders.module.cleader-sg.aws (close), waiting for: module.cleaders.module.cleader-sg.provider.aws
2016/04/10 23:43:55 [DEBUG] vertex module.cleaders.module.cleader-sg.aws_security_group.main (orphan), waiting for: module.cleaders.module.cleader-sg.provider.aws
2016/04/10 23:43:55 [DEBUG] vertex root, waiting for: provider.module.cleaders.module.cleader-sg.aws (close)
2016/04/10 23:44:00 [DEBUG] vertex provider.module.cleaders.module.cleader-sg.aws (close), waiting for: module.cleaders.module.cleader-sg.provider.aws
2016/04/10 23:44:00 [DEBUG] vertex module.cleaders.module.cleader-sg.aws_security_group.main (orphan), waiting for: module.cleaders.module.cleader-sg.provider.aws
2016/04/10 23:44:00 [DEBUG] vertex root, waiting for: provider.module.cleaders.module.cleader-sg.aws (close)

that loops endlessly.. Scrolling up to find the beginning of that loop, I see:

2016/04/10 23:36:05 [DEBUG] vertex provider.aws (close), got dep: aws_iam_group_membership.admin
2016/04/10 23:36:05 [DEBUG] vertex root.provider.aws (close): walking
2016/04/10 23:36:05 [DEBUG] vertex root.provider.aws (close): evaluating
2016/04/10 23:36:05 [TRACE] [walkInput] Entering eval tree: provider.aws (close)
2016/04/10 23:36:05 [DEBUG] root: eval: *terraform.EvalCloseProvider
2016/04/10 23:36:05 [TRACE] [walkInput] Exiting eval tree: provider.aws (close)

Poking at that some more with TF log in info, I see the following:

2016/04/11 00:02:14 [ERROR] root.cleaders.cleader-sg: eval: *terraform.EvalConfigProvider, err: 1 error(s) occurred:

* No valid credential sources found for AWS Provider. 
  Please see https://terraform.io/docs/providers/aws/index.html for more information on 
  providing credentials for the AWS Provider
2016/04/11 00:02:14 [ERROR] root.cleaders.cleader-sg: eval: *terraform.EvalSequence, err: 1 error(s) occurred:

* No valid credential sources found for AWS Provider. 
  Please see https://terraform.io/docs/providers/aws/index.html for more information on 
  providing credentials for the AWS Provider
2016/04/11 00:02:14 [ERROR] root.cleaders.cleader-sg: eval: *terraform.EvalOpFilter, err: 1 error(s) occurred:

* No valid credential sources found for AWS Provider. 
  Please see https://terraform.io/docs/providers/aws/index.html for more information on 
  providing credentials for the AWS Provider
2016/04/11 00:02:14 [ERROR] root.cleaders.cleader-sg: eval: *terraform.EvalSequence, err: 1 error(s) occurred:

* No valid credential sources found for AWS Provider. 
  Please see https://terraform.io/docs/providers/aws/index.html for more information on 
  providing credentials for the AWS Provider

It seem this is the errant module, but I find this very perplexing.. as I'm not seeing how this module has no creds for the AWS provider.

ketzacoatl on 11 Apr 2016

So... why not collect the instance IDs and make a single API call for all of them? (And same for security groups etc).

tarrall on 7 Jun 2016

👍2

The errors I received here have gone away after upgrading the environment to use Terraform 0.7.1

ketzacoatl on 4 Sep 2016

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.