Vault: AWS IAM keys are eventually consistent, Vault neglects to check or wait

Created on 5 Aug 2017 · 13Comments · Source: hashicorp/vault

Environment:

Vault Version: 0.7.3
Operating System/Architecture: Ubuntu 16.04 x86_64 4.4.0-78-generic

Vault Config File:

listener "tcp" {
  address = "0.0.0.0:8200"
  cluster_address = "0.0.0.0:8201"
  tls_disable = "true"
}

storage "consul" {
  address = "127.0.0.1:8500"
  path    = "vault/"
}

telemetry {
  statsd_address = "127.0.0.1:8125"
}

Expected Behavior:
I am using Vault to generate temporary AWS IAM keys for Terraform, which uses these keys to connect to AWS and manage the infrastructure. TF config:

provider "vault" {
  address = "http://127.0.0.1:8200"
  skip_tls_verify = "true"
}

data "vault_generic_secret" "aws_iam_keys" {
  path = "aws/creds/admin"
}

provider "aws" {
  region = "${var.region}"
  access_key = "${data.vault_generic_secret.aws_iam_keys.data["access_key"]}"
  secret_key = "${data.vault_generic_secret.aws_iam_keys.data["secret_key"]}"
}

So when I run terraform plan or any other command that connects to AWS, I would expect TF to connect to Vault, get a key pair, and do its job.

Actual Behavior:

$ terraform plan
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

data.external.region: Refreshing state...
data.vault_generic_secret.aws_iam_keys: Refreshing state...
Error refreshing state: 1 error(s) occurred:

* provider.aws: InvalidClientTokenId: The security token included in the request is invalid.
    status code: 403, request id: 798bd539-796e-11e7-ae15-79e40f1d9b32

Steps to Reproduce:
Happens every time TF requests a key pair from Vault, no matter what.

Important Factoids:
I've found a workaround which consists in a simple 10 second delay between the time Vault generates the keys and the time Terraform uses them. That is enough to make Terraform work reliably with the Vault-generated keys. This seems to indicate that the IAM keys are "eventually consistent", but Vault neglects to check that the keys are actually available before it hands them off to the requester, or at least wait a while and then hand them off.

This is the workaround:

provider "vault" {
  address = "http://127.0.0.1:8200"
  skip_tls_verify = "true"
}

data "vault_generic_secret" "aws_iam_keys" {
  path = "aws/creds/admin"
}

data "external" "region" {
  # workaround for Vault bug
  # https://github.com/terraform-providers/terraform-provider-aws/issues/1086
  program = ["./delay-vault-aws"]
}

provider "aws" {
  region = "${data.external.region.result["region"]}"
  access_key = "${data.vault_generic_secret.aws_iam_keys.data["access_key"]}"
  secret_key = "${data.vault_generic_secret.aws_iam_keys.data["secret_key"]}"
}

And the related script:

#!/usr/bin/env bash
sleep 10
echo '{ "region": "us-west-2" }'

As you can see, this is pretty horrible stuff in Terraform terms. Creating a delay in TF is a pain in the neck because TF does not have any such native capabilities. In the script I am returning a critical piece of configuration (the AWS region string) to make sure execution actually waits for sleep 10 and doesn't proceed further (which would fail).

Vault should not relay the keys downstream before they are actually usable - that's like setting a trap for whoever consumes those keys.

Please do the right thing, and either check actual key availability before handing it over to the consumer, or at least introduce some delay.

References:
https://github.com/hashicorp/terraform/issues/2972
https://github.com/terraform-providers/terraform-provider-aws/issues/1086

bug secreaws versio0.7.x

Source

FlorinAndrei

Most helpful comment

For what it's worth, I just came by this issue. I was involved in the earlier issue for this when the AWS backend first came out.

This is by no means a problem that's specific to Vault. It exists in everything that uses the IAM API to generate keys, including AWS' own CloudFormation and other tools. IAM is eventually consistent, and that's just how it works. IMO Vault is doing the right thing by returning the creds once they're generated and leaving it up to the client to figure out what the right thing, for them, is to do - whether it's return the creds immediately to a user or out-of-band process, or poll until they're live in the required region.

The gist of the above - that IAM creds are eventually consistent - is well-known and well-documented.

We've worked around this in 2 ways:

We use STS tokens wherever we can (i.e. anywhere that isn't making IAM calls). Anything that uses a SDK that follows AWS's specifications, certainly any of the current official SDKs, will support STS tokens.
I'm sure this will make a lot of people groan, but we usually wrap terraform in a ruby gem of ours, tfwrapper and run terraform from Rake tasks (sort of long story, and sort of relic from tf ~0.6 when it didn't support var files/env vars or environments). One of the things that gem does for us is re-run any terraform runs that failed on status code: 403 or status code: 401, which allows us to handle credential propagation issues. Do I wish terraform could do that itself? Sure. But this has worked for us so far.

In short, I've been dealing with this issue both with Vault and with other tooling for a while... and I really believe that it's up to the _consumer_ of the API credentials to handle them in an anti-fragile way, and do the right thing if given some creds that aren't working yet.

jantman on 10 Nov 2017

👍2

All 13 comments

This has come up before but I think the result of the discussion was to keep the current behavior. The problem is that there are no guarantees about how long it will take for IAM in each region to get the creds replicated. So even if we try polling, the fact that the creds work once doesn't mean it'll work for the requestor after. So the only thing we can do is delay and hope for the best, and then we're in the business of having to decide what that delay should be...something better left for the client to decide.

jefferai on 5 Aug 2017

I understand where you're coming from. But look at it this way: If I'm using only Hashicorp tools, I expect them to work well together. Terraform could be used with Vault as a IAM key pair provider - in fact it's a very tempting use case; but due to this bug, Terraform fails every time. And there's no easy way in Terraform to introduce a delay, just an ugly hack.

There's also the enormous time I've wasted trying to troubleshoot this, to the point where I had started questioning the wisdom of using Vault at all. I'm fairly certain you don't want your users to go through this.

FlorinAndrei on 7 Aug 2017

We have this issue with our automation using terraform and vault too. Really frustrating that tools made by hashicorp don't work together.

ericfode on 7 Aug 2017

@FlorinAndrei The main issue is that we literally don't know how to solve this problem. Any delay we can inject may be too much or too little, and simply attempting to use the IAM credentials (e.g. via the newer GetCallerIdentity call) may work depending on which IAM you're hitting or may not if that particular endpoint hasn't gotten the eventually consistent creds yet, assuming your creds policy even allows such a call.

If there was a way to find out if a set of creds was fully replicated throughout IAM it'd be easy. If there is such a way, I don't know about it. What you're describing as a solution to an ugly hack in Terraform sounds to me like simply shifting the burden to an ugly hack in Vault.

This isn't an issue of "tools made by HashiCorp not working together". Your problem is with the behavior of IAM; other dynamic secrets from Vault work with no issue when fetched with Terraform via the same methods. I'm happy to fix this if a solution can be found that isn't simply shifting an ugly hack from point A to point B.

jefferai on 8 Aug 2017

I basically agree with everything @jefferai said, and I want to add something else. In an eventually consistent system such as AWS, the onus really has to be on the client to retry in the event of eventual consistency failures. So it seems to me that the right solution is to have Terraform retry if the credentials aren't valid. Have you opened a feature request with the Terraform AWS provider to retry in the event of credential failures?

joelthompson on 8 Aug 2017

It seems AWS wrote their own hacks in their SDK to try to work around this:

https://github.com/aws/aws-sdk-go/blob/master/service/iam/waiters.go

We could use those but I'm not experienced enough with IAM to know if either of those calls (or GetCallerIdentity) will actually be available for any given set of IAM credentials with any given policy.

BTW, the Terraform people said the following: we poll, and sometimes when polling says yes, a follow up request to another service to reference said IAM thing that is up, will then fail because it’s not replicated and we've occasionally seen issues with the API not even being self-consistent when Terraform itself is polling and then using the credentials, but this seems to be exacerbated when the thing doing the polling is running somewhere different than the thing using the result.

So it's bad news everywhere.

Again, if there is some call that has a reasonable chance of a) being allowed for any given set of IAM creds and b) giving us somewhat of an idea of whether the creds are active, I'm happy to add it. I just don't know what those are.

jefferai on 8 Aug 2017

It really seems like the "eventual consistency" of IAM keys within AWS is the problem here. We take AWS for granted, but it's software/hardware just like any other, and it has issues and bugs as usual.

I wonder if these scenarios would be different - whether they would have different latencies before the keys can actually be used:

make a new temporary user, generate keys for it, hand the keys over to the Vault requester
use a non-temporary user, make a new key pair for it, etc

Vault does 1. I wonder if strategy 2 would yield faster consistency across the AWS board. But OTOH I'm sure 2 would hit some resource limits with AWS pretty quickly.

FlorinAndrei on 9 Aug 2017

There's also the possibility of getting STS tokens instead of generating IAM users, which Vault does support (but may not work for all use-cases).

jefferai on 10 Aug 2017

use a non-temporary user, make a new key pair for it, etc

That's difficult because users are limited to two keys. It's mostly just for making rotating easier -- generate a new one, replace the old one, verify nothing is using the old one, delete the old one.

There's also the possibility of getting STS tokens instead of generating IAM users, which Vault does support (but may not work for all use-cases).

Yeah -- main thing would be software that doesn't know how to take in STS tokens (the AWS secret backend in Vault itself is an example of this!), or things that need a lease longer than 1hour.

And, STS tokens could very well suffer from the same issues.

The waiters definitely aren't the way to go here, because the waiters might see the credentials as being valid and then clients might not.

For those experiencing some sort of pain from this, I'd strongly suggest you reach out to your AWS support/account team asking for the thing Jeff mentioned, i.e., "some call that has a reasonable chance of a) being allowed for any given set of IAM creds and b) giving us somewhat of an idea of whether the creds are active." sts:GetCallerIdentity meets the first requirement, but it probably fails the second.

I really like the idea of Grant Tokens in KMS, they're designed for precisely this problem. I was hoping AWS would adopt that pattern more broadly, but I'm not aware of anything else they have with the same concept :/

joelthompson on 10 Aug 2017

The workaround from @FlorinAndrei worked for me, although since I was using the hashistack Terraform module from https://github.com/hashicorp/nomad/tree/master/terraform/aws/modules/hashistack, I had to modify my Vault AWS policy to include both "ec2:" and "iam:" actions. Originally, I only had the "ec2:*" action. Here is my complete policy file:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1426528957000",
"Effect": "Allow",
"Action": [
"ec2:",
"iam:"
],
"Resource": [
"*"
]
}
]
}

rberlind on 27 Aug 2017

Yeah -- main thing would be software that doesn't know how to take in STS tokens (the AWS secret backend in Vault itself is an example of this!), or things that need a lease longer than 1hour.

@joelthompson Could you elaborate more on Vault AWS secret backend not knowing how to take STS tokens?

As to lease duration, AFAIK STS tokens can be generated for more than 1h (15m-36h) when configured not with root AWS account but IAM account with sts:AssumeRole attached. Acutally that's the recommended configuration, because it's rather risky to give permissions to root account. AWS docs describes it in detail.
http://docs.aws.amazon.com/STS/latest/APIReference/API_GetFederationToken.html

So for me the temporary solution would be to switch to STS Federation Tokens, at least with terraform, until some real fix to this issue will pop up.

Yuxael on 3 Nov 2017

For what it's worth, I just came by this issue. I was involved in the earlier issue for this when the AWS backend first came out.

The gist of the above - that IAM creds are eventually consistent - is well-known and well-documented.

We've worked around this in 2 ways:

We use STS tokens wherever we can (i.e. anywhere that isn't making IAM calls). Anything that uses a SDK that follows AWS's specifications, certainly any of the current official SDKs, will support STS tokens.
I'm sure this will make a lot of people groan, but we usually wrap terraform in a ruby gem of ours, tfwrapper and run terraform from Rake tasks (sort of long story, and sort of relic from tf ~0.6 when it didn't support var files/env vars or environments). One of the things that gem does for us is re-run any terraform runs that failed on status code: 403 or status code: 401, which allows us to handle credential propagation issues. Do I wish terraform could do that itself? Sure. But this has worked for us so far.

jantman on 10 Nov 2017

👍2

Hi there!

I'm closing this issue now since it appears to me that this issue is not possible to solve by Terraform nor Vault and is more related to the AWS API and how it behaves. Additionally, this topic hasn't been touched for the last two years.

If someone here thinks differently / has new information regarding this topic I'm happy to reopen it to continue with the discussion.

Cheers,
Michel