Terraform-provider-aws: Expired STS token results in terraform to hang

Created on 5 Aug 2017 · 18Comments · Source: hashicorp/terraform-provider-aws

If my STS token in ~/.aws/credentials is expired, when I invoke terraform apply, it will seemingly hang and become unresponsive, requiring two SIGINTs to quit. Trace logs show that it's repeatedly calling sts:GetCallerIdentity which resulting in 403 Forbidden with an ExpiredToken code.

...
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: Action=GetCallerIdentity&Version=2011-06-15
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: -----------------------------------------------------
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: 2017/08/04 21:14:40 [DEBUG] [aws-sdk-go] DEBUG: Response sts/GetCallerIdentity Details:
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: ---[ RESPONSE ]--------------------------------------
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: HTTP/1.1 403 Forbidden
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: Connection: close
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: Content-Length: 297
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: Content-Type: text/xml
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: Date: Sat, 05 Aug 2017 04:14:39 GMT
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: X-Amzn-Requestid: 99b535db-7994-11e7-8d9e-e17db6dd7b22
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: 
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: 
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: -----------------------------------------------------
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: 2017/08/04 21:14:40 [DEBUG] [aws-sdk-go] <ErrorResponse xmlns="https://sts.amazonaws.com/doc/2011-06-15/">
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4:   <Error>
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4:     <Type>Sender</Type>
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4:     <Code>ExpiredToken</Code>
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4:     <Message>The security token included in the request is expired</Message>
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4:   </Error>
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4:   <RequestId>99b535db-7994-11e7-8d9e-e17db6dd7b22</RequestId>
2017/08/04 21:14:40 [DEBUG] plugin: terraform-provider-aws_v0.1.3_x4: </ErrorResponse>
...

Terraform Version

Terraform v0.10.0

Affected Resource(s)

N/A

Terraform Configuration Files

provider "aws" {
  region = "us-east-1"
}

resource "aws_security_group" "default" {
  # doesn't matter which resource(s) are used
  name = "foo"
}

Debug Output

See above. I can generate a full trace log if necessary.

Panic Output

N/A

Expected Behavior

What should have happened?

The authentication process should check for a ExpiredToken response code and either return an error or emit some message. I found that if I swap in a fresh unexpired token while terraform apply is in this (seemingly unresponsive) loop, it'll work. If the token expires in the middle of a command, it would be nice to allow for the user to replace the token (i.e., by polling like it does now), but if the very first auth results in an ExpiredToken, then perhaps it would be appropriate to abort the command?

Actual Behavior

What actually happened?

Terraform seemed to hang, requiring two Ctrl-Cs to abort.

Steps to Reproduce

terraform apply

Important Factoids

References

enhancement provider

Source

ericdahl

👍34

Most helpful comment

I too see this issue very often. I think terraform should just exit when it receives token expiry error. When we do SIGINT the state files are left with incomplete states, these are caused because some of the states where not updated in the state file because the creation complete response is not received by terraform yet.

roshpr on 31 Jan 2018

👍2

All 18 comments

I think this behaviour is kind of intended because Terraform does not have a way to lease a new (valid) token and expects the user to do it and carry on once it does.

With that said we could (and probably should) give the user some feedback in the UI when this occurs instead of blindly retrying behind the scenes.

radeksimko on 7 Aug 2017

This would be a nice enhancement. I've been working around this by prefixing my terraform commands with an aws cli command to verify my credentials have not expired yet because aws cli will throw a 255 exit code when credentials expire:

aws s3 ls > /dev/null ; echo $?

An error occurred (ExpiredToken) when calling the ListBuckets operation: The provided token has expired.
255

Preston4tw on 7 Aug 2017

aws sts get-caller-identity may be more lightweight in that context and for that purpose.

radeksimko on 7 Aug 2017

👍2

The expiry of the token causes a bigger problem if terraform hangs and cannot progress - and cannot save its state to the S3 bucket that you're using for backend state storage because you don't have permission... because your token has expired. The only workaround that I've found is to ensure that before any terraform operation, get a new token. And hope that it won't take more than an hour to do the operation.

charles-at-geospock on 9 Aug 2017

@charles-at-geospock Thanks for sharing feedback from that angle. Do you have any suggestions for solutions in mind?

From my own experience token is quite an abstract thing in AWS as it may come from different sources (sts/GetSessionToken, plain sts/AssumeRole, sts/AssumeRoleWithSAML, sts/AssumeRoleWithWebIdentity or sts/GetFederationToken) and therefore the process of refreshing the token may differ significantly depending on your environment.

In some cases we can't really refresh the token at all without prompting the user for some extra input (e.g. login details for SAML/OAuth) and that would require some significant changes in both core schema helper and aws provider to deal with this. Also, Terraform was always designed mainly as a CLI tool which has capability to do things w/out user's interaction, where possible so I'd be hesitant about adding such complex functionality into the UI as it can make scripting more difficult.

Ain't saying it's impossible or that we won't do it - I'm merely sharing my viewpoint.

radeksimko on 9 Aug 2017

I'm honestly not sure - the go SDK appears to support the AWS_PROFILE variable that the command line tools use, but I couldn't see how to make it work with Terraform to use that, or whether that would be able to be handled by the SDK itself for renewal.

I've only used the AssumeRole method, so I'm not sure of the others - Looking at the ARNs returned, there may be some way to handle this if there is a consistent form that could be interpreted, eg *:sts:<account>:<mechanism>/<parameters>
where controls the parameters:

* 'assume-role' => `<role>/<label>` (and the original user is given by the tail of the user-id)

though how you might recover the original credentials, I'm not sure - for me, at the command line, I use revert to the default profile then use the extracted parameters to re-request a token, but that won't work if the token came from elsewhere.

The more I look at this (and say to myself 'huh, I've had a quite naive approach'), the more I think that the only way to usefully deal with renewing the token is if TF itself gained it (using the necessary parameters) such that it can do so again when an expiry error happens. Maybe trying to get some support off Amazon to find out what they might expect for this sort of thing - presumably they have similar constraints when it comes to CloudFormation. I haven't spoken to Amazon about such things, though.

I wholeheartedly agree that requiring the user to interact in the middle of the session probably isn't a workable solution - I see one place where TF wins its place is in the CI/CD workflow, being the same mechanism, used to test the system during testing as for deployment. In such cases, just as you hope for production use, you hope to leave it alone and let it get on with the job :-(

One other 'interesting' real world case of problems that I had was that the token expired after requesting the building of a number of expensive EC2 systems... and then because it couldn't store the results in the S3 bucket (and I didn't think to look at the backup files), it never recorded the instance ids. There were just a few extra EC2 systems running when I came to review the state of an account a few days later. Because the expiry happened before the tags were asserted (as the tag assignment is done after the creation for some operations), finding the reason those systems had been created was slightly more difficult - they didn't have names, or associated 'purpose' or other tags that we use for tracking.

I hope it's not felt that I've hijacked a thread with a variation on the issue, but seeing that someone else suffered from the same things I have spurred me to give my own experience in the hope that someone knows a way to go.

charles-at-geospock on 9 Aug 2017

I see the same issue except that Ctrl-C isn't sufficient - I seem to have to kill -9 the process 😕

If I lean on Ctrl-C, here's what I see in the debug log:

^C2017-10-26T13:13:53.406+0100 [DEBUG] plugin.terraform-provider-template_v1.0.0_x4: 2017/10/26 13:13:53 [DEBUG] plugin: received interrupt signal (count: 33). Ignoring.
2017-10-26T13:13:53.406+0100 [DEBUG] plugin.terraform-provider-aws_v1.1.0_x4: 2017/10/26 13:13:53 [DEBUG] plugin: received interrupt signal (count: 33). Ignoring.
^C2017-10-26T13:13:53.426+0100 [DEBUG] plugin.terraform-provider-aws_v1.1.0_x4: 2017/10/26 13:13:53 [DEBUG] plugin: received interrupt signal (count: 34). Ignoring.
2017-10-26T13:13:53.426+0100 [DEBUG] plugin.terraform-provider-template_v1.0.0_x4: 2017/10/26 13:13:53 [DEBUG] plugin: received interrupt signal (count: 34). Ignoring.

This is on OS X 10.12.6 with terraform 0.10.8. Any suggestions why ctrl-c doesn't work here?

jdelStrother on 26 Oct 2017

roshpr on 31 Jan 2018

👍2

I also experience this issue, and as a result use the following workaround in my ~/.bash_profile:

terraform () {
    aws sts get-caller-identity > /dev/null && /usr/local/bin/terraform "$@"
}

rifelpet on 31 Jan 2018

in my opinion this is the expected behaviour. any workaround would open a security risk
also, even if TF fails , the state is kept in ram, till you rerun apply and then it's pushed to s3.

on other way, is to use aws-vault with --server mode, which will renew the credentials for you

FernandoMiguel on 3 Feb 2018

👎1

@FernandoMiguel How is it a security risk?

lvh on 19 Mar 2018

@lvh long running credentials

FernandoMiguel on 10 Apr 2018

Gotcha. I was confused because the main workaround/bug fix I see being discussed right now is having terraform error out when the credentials become invalid instead of just hanging, that doesn't increase credential lifetime, it just improves error messages. How would you otherwise recover from a hanging TF process? (I agree the state is kept in RAM there, but that doesn't seem very useful as it continues to bash its head against the AWS SDK with an expired credential :))

lvh on 10 Apr 2018

👍1

@lvh I will agree with you there.
I get sometimes hit by this issue too, and seeing tf keep retrying is just silly

FernandoMiguel on 12 Apr 2018

I know I said this was by design...
But since I keep getting hit by this when running from my laptop (not ci obviously),
Is there a way for the time out of the credentials to be lower?

FernandoMiguel on 1 Jun 2019

No reason to opaquely retry. Fail per principle of least astonishment. I can re-auth, push the erred tfstate file and get on with my life.

robottaway on 24 Jul 2019

lamazoidius

woodcockjosh on 20 Aug 2019

It would be really nice if this was handled simply by exiting and displaying the expired token message. Currently I have to kill the process and forcibly unlock the Terraform cloud workspace which is a lot of manual work for an error which could trivially be detected with no side-effects.

Reported as https://support.hashicorp.com/hc/en-us/requests/24845