Terraform: Inconsistent results when creating AWS VPC infrastructure using Terraform

Created on 21 May 2016  ·  10Comments  ·  Source: hashicorp/terraform

We are using Terraform for laying down AWS infrastructure resources. Our VPC consists of: 10 or so subnets, a VPN server instance, 3 route 53 zones, several security groups, etc...

When we deploy the infrastructure it usually works without error. But occasionally we get errors that we can't explain. Here's an example:

I, [2016-05-20T00:29:50.251046 #3328] INFO -- : Error applying plan:
I, [2016-05-20T00:29:50.251046 #3328] INFO -- :
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : 7 error(s) occurred:
I, [2016-05-20T00:29:50.266650 #3328] INFO -- :
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_subnet.sub: InvalidSubnetID.NotFound: The subnet ID 'subnet-632d513b' does not exist
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: bebb274d-0a37-422d-b3ab-7270f8b49519
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * Resource 'aws_security_group.manhattan_master' does not have attribute 'id' for variable 'aws_security_group.manhattan_master.id'
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_security_group_rule.outbound_ssh_master: Error authorizing security group rule type egress: InvalidGroupId.Malformed: Invalid id: "${var.master_security_group_id}" (expecting "sg-...")
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: 51d73183-12ad-4652-bcc7-5ca23da229ae
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_security_group_rule.mesos_slave_master: Error authorizing security group rule type ingress: InvalidGroupId.Malformed: Invalid id: "${var.master_security_group_id}" (expecting "sg-...")
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: f2ad1c25-d49b-4cf4-a179-46653d89e442
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_security_group_rule.all_ports_master: Error authorizing security group rule type ingress: InvalidGroupId.Malformed: Invalid id: "${var.master_security_group_id}" (expecting "sg-...")
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: cdfc616c-133e-4c87-a1ce-676d9a0e76fe
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_route_table.private_route_table_1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-f1296696' does not exist
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: 48babf0e-b9d1-47fe-aa19-283328cfcb15

We've opened a case with AWS and they came back with the following response:


Hello,

Thank you for contacting AWS Premium Support.

I understand that you are seeing a few errors while using the Terraform tool. Let me address the errors one by one:

  • aws_subnet.sub: InvalidSubnetID.NotFound: The subnet ID 'subnet-632d513b' does not exist
  • aws_route_table.private_route_table_1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-f1296696' does not exist
  • From the request Ids for this error, I saw that the api call was to create a tag to the subnet-632d513b and rtb-f1296696. This failed because the subnet/route table did not exist. Now, this maybe because of two reasons. Either the resource creation failed before this or the resource was not found because of the eventual consistency model of the api calls [1]
  • Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
  • Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
  • Resource 'aws_security_group.manhattan_master' does not have attribute 'id' for variable 'aws_security_group.manhattan_master.id'
  • This is again because the subnet and security group was not created/still in the process of creating/eventual consistency model of api calls.
  • aws_security_group_rule.outbound_ssh_master: Error authorizing security group rule type egress: InvalidGroupId.Malformed: Invalid id: ""
  • This error could be a consequence of the previous errors. The security group id could not be found and hence the variable var.master_security_group_id was not set and this caused the api call to fail.

The way around the eventual consistency model is to implement retries and exponential backoffs in the application. I am not sure if Terraform has implemented it.

Also, when you get these errors, you can go ahead and check if the resources actually exists or not in the AWS console. This way you can narrow down the issue if it was a reaource creation error or not.

Hope this information is helpful to you. Please let me know if you have further questions and I will be happy to help you.

Links:
[1] http://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency

Best regards,

Truptesh
Amazon Web Services

We value your feedback. Please rate my response using the link below.


We read this as AWS calling this an issue that the caller of the AWS SDK needs to handle. In our case Terraform calls AWS SDK. Either thru retries or by checking to make sure AWS resources are fully deployed before you use ID's for those resources. This must not be a problem unique to us? I assume AWS would be one of the more popular providers for users of Terraform? Has this eventual consistency issue come up before? Are there any plans/ways to address this in Terraform?

Terraform Version

0.6.16

Affected Resource(s)

Please list the resources as a list, for example:

  • aws_eip
  • aws_subnet
  • others...

The problem is intermittent and doesn't always fail in the same way.

Terraform Configuration Files

It's a fairly large source base with some proprietary logic in it. It may be difficult to share all of the TF scripts involved.

Debug Output

Please provider a link to a GitHub Gist containing the complete debug output: https://www.terraform.io/docs/internals/debugging.html. Please do NOT paste the debug output in the issue; just paste a link to the Gist.

Expected Behavior

The VPC and supporting resources should be able to deployed successfully EVERY Time.

Actual Behavior

Terraform Apply attempts fail maybe as high as 30% of the time.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform apply

Wondering if you have any bug fixes or ideas on how to make this code run more stable?

bug provideaws waiting-response

Most helpful comment

Just hit a similar error:

aws_subnet.private-persistence.2: InvalidSubnetID.NotFound: The subnet ID 'subnet-xxxxxxx' does not exist

Snippet of relevant code:

resource "aws_subnet" "private-persistence" {
    count = "${length(split(",", var.aws_availability_zones))}"
    vpc_id = "${aws_vpc.main.id}"
    availability_zone = "${element(split(",", var.aws_availability_zones), count.index)}"
    cidr_block = "${cidrsubnet(var.cidr_block, 5, count.index + 10)}"
}

resource "aws_route_table_association" "private-persistence" {
    count = "${length(split(",", var.aws_availability_zones))}"
    subnet_id = "${element(aws_subnet.private-persistence.*.id, count.index)}"
    route_table_id = "${element(aws_route_table.private-persistence.*.id, count.index)}"
}

I'm using Terraform v0.6.16.

All 10 comments

Hey @achalupa74 – Terraform certainly does do retries and we're very aware of eventual consistency gotcha's that come about when using a platform as large as AWS!

That said we're of course not perfect and there are still scenarios we haven't covered 100%, but that's why I'm here 😄

As you can guess, some setups of sufficient size have enough moving parts that make it difficult for me to diagnose the root cause of without more information. Do you possibly have an example configuration that reliably (even if only ~30% of the time) reproduces these issues? I understand if you're infrastructure is sophisticated enough that trimming it down to make a reproduction case is not feasible.

* aws_subnet.sub: InvalidSubnetID.NotFound: The subnet ID 'subnet-632d513b' does not exist
* aws_route_table.private_route_table_1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-f1296696' does not exist

These errors are typically handled gracefully, I would be interested to see how you are referencing subnet ids, if at all. If you can share a snippet of the config that may help, but please be sure to omit any secrets!

Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
Resource 'aws_security_group.manhattan_master' does not have attribute 'id' for variable 'aws_security_group.manhattan_master.id'

These kinds of errors are unusual but have been reported before. I believe they are being tracked in another GitHub issue and being worked on. They are rare, but perhaps you're hitting something that makes them more common.

In short, we do take these kinds of issues regarding stability very serious and we're always working to make Terraform more stable and resilient. Unfortunately I can only provided limited help without further configuration to help me reproduce something if there is a systemic problem.

Thanks for the reply!

We have a failure complicated infrastructure being deployed by Terraform. The initial deployment is probably between 200 and 300 AWS resources. Short of trying to come up with a simpler example that can somewhat consistently exhibit the problem I can show you the parts of the code that the reported errors are likely related to.

First off the Security Group configuration is very isolated. We have a module for each security group and the 'manhattan_master' module that creates the security group referenced in the error is attached.
manhattan_master_security_group.zip

This is very simple and isolated. The attached module creates a security group and adds a series of SG rules to this security group. I can't see how to make this code any better other than to maybe added "depends_on" clause on every single rule?

The use of subnet ID's is a bit more complicated but I can give you a code snippet that will hopefully lead you in the right direction.

module "dmz_subnet_1" {
source = "../subnet"

stack_name = "${var.stack_name}"
subnet_name = "dmz1"
subnet_cidr = "${var.dmz_subnet_cidr_1}"
availability_zone = "${var.availability_zone_1}"
vpc_id = "${module.vpc.vpc_id}"
route_table_id = "${aws_route_table.public_route_table.id}"
region = "${var.region}"
profile = "${var.default_profile}"
}

resource "aws_eip" "elastic_ip_dmz_subnet1" {
provider = "aws.base"
vpc = true
}

resource "aws_nat_gateway" "nat_gateway_dmz_subnet_1" {
provider = "aws.base"
allocation_id = "${aws_eip.elastic_ip_dmz_subnet1.id}"
subnet_id = "${module.dmz_subnet_1.subnet_id}"

depends_on = [ "aws_eip.elastic_ip_dmz_subnet1", "aws_internet_gateway.internet-gateway" ]
}

This sequence basically:
- creates a subnet (by calling a module)
- Allocates an elastic IP
- creates a NAT gateway that ties the subnet and the eIP together

Could our problems somehow be related to the fact that the subnet is being created in a module?

Any help is much appreciated!

+1

Hi @catsby - You said: "I believe they are being tracked in another GitHub issue and being worked on."
If you have that link, I would like to see that issue. Thanks

@brendonmartino – I misspoke I suppose, the issue I was referring to was a PR meant to fix this kind of issue:

and follow up PR:

Unfortunately I do not believe either of those are in a release version of Terraform, but can be found in v0.7.0-rc1

Just hit a similar error:

aws_subnet.private-persistence.2: InvalidSubnetID.NotFound: The subnet ID 'subnet-xxxxxxx' does not exist

Snippet of relevant code:

resource "aws_subnet" "private-persistence" {
    count = "${length(split(",", var.aws_availability_zones))}"
    vpc_id = "${aws_vpc.main.id}"
    availability_zone = "${element(split(",", var.aws_availability_zones), count.index)}"
    cidr_block = "${cidrsubnet(var.cidr_block, 5, count.index + 10)}"
}

resource "aws_route_table_association" "private-persistence" {
    count = "${length(split(",", var.aws_availability_zones))}"
    subnet_id = "${element(aws_subnet.private-persistence.*.id, count.index)}"
    route_table_id = "${element(aws_route_table.private-persistence.*.id, count.index)}"
}

I'm using Terraform v0.6.16.

Update: still intermittently seeing the same aws_subnet.private-persistence.2: InvalidSubnetID.NotFound: The subnet ID 'subnet-xxxxxxx' does not exist error on Terraform 0.7.2.

Hello – I'm following up on this issue as some time has passed and we've since released several new versions of Terraform.

Unfortunately our findings here were inconclusive; we were never able to reproduce this issue. Can anyone comment further, or supply a reproduction case? I would like to know if this is still an issue you're encountering, otherwise I'd like to close the issue.

I'm going to close this for now. Please let us know if you anyone has more information or a reproduction case. Thanks!

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

radeksimko picture radeksimko  ·  80Comments

ncraike picture ncraike  ·  77Comments

phinze picture phinze  ·  86Comments

lukehoersten picture lukehoersten  ·  151Comments

mirogta picture mirogta  ·  74Comments