terragrunt --terragrunt-iam-role fails to assume role in Govcloud

Created on 25 Jul 2018  路  17Comments  路  Source: gruntwork-io/terragrunt

When using terragrunt --terragrunt-iam-role arn:aws-us-gov:iam:: when operating on an account in Govcloud (region us-gov-west-1), expected behavior is for terragrunt to assume the given role, and perform its operations. Instead, it ends up failing with InvalidClientTokenId errors.

github-terragrunt-issue.txt
The attached file shows the output from a run.

More information:

  1. Running terragrunt --terragrunt-iam-role arn:aws-us-gov:iam:: in Commercial works as expected and assumes the role correctly.
  2. Running aws sts assume-role --role-arn arn:aws:iam::111111111111:role/allow-full-access-from-other-accounts --role-session-name testing and settings env vars works correctly.
  3. Outside of terragrunt/terraform, using awscli with role_arn = arn:aws:iam::111111111111:role/allow-full-access-from-other-accounts and source_profile = mysecurityaccountprofile works as expected, assuming the role and executing the desired awscli command
  4. Using terragrunt with an AWS_PROFILE env var set to a profile that uses role_arn/source_profile does NOT work for me in Govcloud or Commercial (likely separate issue, but wanted to make note)

Let me know what other information I can provide, and if there are any debug steps I could use to get more information. I didn't find TERRAGRUNT_DEBUG=true to provide me with any more information on what endpoints the Go AWS SDK was providing.

All 17 comments

@dg-nthompson Can you verify your setup again? My understanding is that GovCloud does not implement the STS service, so assume role should not be possible at all, by design (currently, anyway).

https://docs.aws.amazon.com/govcloud-us/latest/UserGuide/using-services.html

I would love to be wrong though! This is one of my most major complaints about GovCloud!

Hi @dg-nthompson! The "The security token included in the request is invalid" error, unfortunately, can mean many different things; it's a classic case of AWS not providing helpful error messages. Some thoughts/ideas:

  1. One potential cause of this error is if Terragrunt is setting the wrong region when trying to assume the role. Are you using AWS credential profiles for auth (i.e., via aws configure)? Is the region set there? Are you setting region via env vars by any chance (e.g., try AWS_DEFAULT_REGION)? Relevant code in Terragrunt is here and here, and as neither part sets a region, this could well be the issue when not using commercial regions.

  2. Is it possible you have any other stray env vars set? Esp an out-of-date AWS_SESSION_TOKEN?

  3. Are you storing state in S3? Is the S3 bucket also in gov cloud? Do you need to set a profile param in the backend config in terraform.tfvars?

Hi @brikis98! I went through another round of testing just to make sure I hadn't overlooked anything that you mentioned above. What I have found is the following (as they relate to your questions above):

  1. No matter how I go about trying to set region, terragrunt --terragrunt-iam-role will always attempt to use sts.amazonaws.com as the STS endpoint instead of sts.us-gov-west-1.amazonaws.com
[pid   658] <... read resumed> "\255\22\201\200\0\1\0\1\0\0\0\0\3sts\tamazonaws\3com\0\0"..., 512) = 51
[pid   652] <... select resumed> )      = 0 (Timeout)
[pid   658] epoll_ctl(6, EPOLL_CTL_DEL, 5, 0xc420333b1c <unfinished ...>
[pid   652] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=20} <unfinished ...>
[pid   658] <... epoll_ctl resumed> )   = 0
[pid   658] close(5)                    = 0
[pid   652] <... select resumed> )      = 0 (Timeout)
[pid   658] futex(0xc4201c4510, FUTEX_WAKE, 1 <unfinished ...>
[pid   652] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=20} <unfinished ...>
[pid   658] <... futex resumed> )       = 1
[pid   657] <... futex resumed> )       = 0
[pid   658] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP <unfinished ...>
[pid   657] epoll_wait(6,  <unfinished ...>
[pid   652] <... select resumed> )      = 0 (Timeout)
[pid   657] <... epoll_wait resumed> [], 128, 0) = 0
[pid   658] <... socket resumed> )      = 3
[pid   657] futex(0xc42000cd10, FUTEX_WAKE, 1 <unfinished ...>
[pid   658] setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4 <unfinished ...>
[pid   657] <... futex resumed> )       = 1
[pid   658] <... setsockopt resumed> )  = 0
[pid   657] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP <unfinished ...>
[pid   658] connect(3, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("54.239.29.25")}, 16 <unfinished ...>

NOTE: Also confirmed via packet capture.

I do have region set in ~/.aws/config, and I have attempted by setting export AWS_DEFAULT_REGION=us-gov-west-1, or passing -var region=us-gov-west-1.

I am right there with you that it feels like the go SDK isn't detecting the region so it isn't looking up the endpoint correctly as described in https://docs.aws.amazon.com/sdk-for-go/api/aws/endpoints/

  1. I have validated that I have no stray env vars each time I run for anything AWS_. No AWS_SESSION_TOKEN, AWS_SECRET, AWS_ACCESS, AWS_DEFAULT_REGION, etc are set with the exception of AWS_PROFILE with the profile that has sts assume-role permissions available to it. This is a terraform set of keys in the security account.

  2. We do store our state in s3 and the bucket is also in govcloud. I can add a profile in there for this, but wouldn't terragrunt use the IAM role it is switching to by default?

Let me know what output/configs you would prefer to see. I can run any tests you would like. As mentioned above, I have no issues with assuming a role on the commandline, so I know the roles themselves work as expected in govcloud.

@dg-nthompson This might be a side-effect of what I found in the AWS Docs on GovCloud...

You can't create a role to delegate access between an AWS GovCloud (US) account and an AWS account.

https://docs.aws.amazon.com/govcloud-us/latest/UserGuide/govcloud-iam.html

What I'm suspecting is when it sees the original profile, it knows it's an AWS commercial account, so it won't let the assume request to be passed to the GovCloud endpoints.

@dg-nthompson OK, so browsing the sts docs, it looks like https://sts.amazonaws.com (us-east-1) is used as the default for all requests unless you override it. This works for normal commercial regions, but for Gov Cloud, the docs say you have to override it to sts.us-gov-west-1.amazonaws.com. So I'm guessing this is the issue you're hitting.

Perhaps the solution is to allow users to specify a custom endpoint for sts in the Terragrunt config. We already do that in one place for S3, so I think it would be reasonably straightforward to add this (note, the other AssumeRole function would need it too).

@dg-nthompson, would you be up for a PR? That would let you test the changes directly with your Gov Cloud setup.

+1
Verified, i am running into this terragrunt incompatibility issue with STS and govcloud both in us-gov-east as well as us-gov-west.
Same exact code works perfectly in normal accounts, ala us-east-1, but not in gov cloud.

To reproduce, run the following against us-gov-east or west:
terragrunt validate-all --terragrunt-ignore-dependency-errors

The error is:
Underlying error: InvalidClientTokenId: The security token included in the request is invalid status code: 403, request id: axxxxx

The problem is, somewhere the terragrunt request is hardcoded to go against:
sts.amazonaws.com

This can be proved by fooling STS and forcing the request to go to sts inn govcloud by modifying /etc/hosts for example as follows:

#--# terragrunt hack to Force STS to govcloud to sts.amazonaws.com -> sts.us-gov-east-1.amazonaws.com # (ping sts.us-gov-east-1.amazonaws.com to get current IP)
52.46.104.8     sts.amazonaws.com

And then the error changes and the problem becomes certs:
Underlying error: RequestError: send request failed caused by: Post https://sts.amazonaws.com/: x509: certificate is valid for sts.us-gov-east-1.amazonaws.com, not sts.amazonaws.com

...next step would be to grep the terragrunt src code and find where sts.amazonaws.com is hardcoded and change it to respect the target region or make it available for override.

...sucks, had high hopes for this tool, as it is a dependency for cloudcraft.co to function...

could be related to:
stsClient := sts.New(sess)

in:
https://github.com/gruntwork-io/terragrunt/blob/4405a6448d1cc54fa9a69d8799aac42b8059dd19/aws_helper/config.go#L91

it looks like it is using AWS Go SDK and based on documentation it defaults to us-east-1:
https://docs.aws.amazon.com/sdk-for-go/api/service/sts/

Endpoints
The AWS Security Token Service (STS) has a default endpoint of https://sts.amazonaws.com that maps to the US East (N. Virginia) region.

trying to figure out how to override this in the Go SDK...

based on this documentation here:
https://docs.aws.amazon.com/sdk-for-go/api/aws/endpoints/

i think it would be adding the region in this function:
https://github.com/gruntwork-io/terragrunt/blob/4405a6448d1cc54fa9a69d8799aac42b8059dd19/aws_helper/config.go#L80

with something like:

// Make API calls to AWS to assume the IAM role specified and return the temporary AWS credentials to use that role
func AssumeIamRole(iamRoleArn string) (*sts.Credentials, error) {
        sess, err := session.NewSession()

       // govcloud region hack:
       sess := session.Must(session.NewSession(&aws.Config{
           Region:           aws.String("us-gov-east-1"),
           EndpointResolver: endpoints.ResolverFunc(myCustomResolver),
       }
      // end govcloud hack

        if err != nil {
                return nil, errors.WithStackTrace(err)
        }

        _, err = sess.Config.Credentials.Get()
        if err != nil {
                return nil, errors.WithStackTraceAndPrefix(err, "Error finding AWS credentials (did you set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables?)")
        }

        stsClient := sts.New(sess)

        input := sts.AssumeRoleInput{
                RoleArn:         aws.String(iamRoleArn),
                RoleSessionName: aws.String(fmt.Sprintf("terragrunt-%d", time.Now().UTC().UnixNano())),
        }

        output, err := stsClient.AssumeRole(&input)
        if err != nil {
                return nil, errors.WithStackTrace(err)
        }

        return output.Credentials, nil
}

@brikis98 do you think you can fix this quickly with the info above ^ or should i plan to ditch my cloudcraft / terragrunt solution for something else compatible with gov cloud?

@jacov, Would you be up for putting together a quick PR?

Key thing to figure out is how to know which region to use for a given Terragrunt user (i.e., us-east-1 vs us-gov-east-1)...

@brikis98 thanks for the offer man, but i don't think i will have time...

If i where to implement this, i would put some logic in there to respect the actual region already defined in either a env.tfvar or to respect AWS_DEFAULT_REGION in environment variables.

it seems that it does not respect the region that you are pointing your terraform / terragrunt solution to, and always going against the default in the go sdk which is us-east for STS requests.

@brikis98 Is there any way this could get prioritized and addressed soon?

@yorinasub17 What's the status of the work we were going to do with Govcloud?

I'm still waiting for access to our GovCloud account.

Please note that we can successfully use terragrunt while with GovCloud if we set AWS_REGION when running the command and assuming the role. We have had this successfully running in our pipeline for the last 6 months or so.

@dg-nthompson Where/how are you setting AWS_REGION?

I've found that setting AWS_REGION or AWS_SDK_LOAD_CONFIG seems to work.

$ AWS_SDK_LOAD_CONFIG=1 AWS_PROFILE=govcloud terragrunt plan

Or

$ AWS_REGION=us-gov-west-1 AWS_PROFILE=govcloud terragrunt plan

I have set AWS_STS_REGIONAL_ENDPOINTS=regional to use the regional endpoint. more info here

Was this page helpful?
0 / 5 - 0 ratings