Terraform-provider-aws: aws_security_group: timeout while waiting for state to become 'success'. Subsequent terraform runs fails on that resource

Created on 25 Jan 2018 · 20Comments · Source: hashicorp/terraform-provider-aws

Short story: we know that AWS is throttling our API requests. Sometimes we timeout on creating a security group. The problem is however that subsequent terraform runs are failing because the security group was created but is not completely present in tfstate. Security group rules are not recorded in tfstate.

Terraform Version

terraform 0.11.2
aws provider ersion 1.7.1

Affected Resource(s)

aws_security_group

There might be a problem on how terraform handles resources that fails. perhaps on failure this resource should be tainted so subsequent runs succeeds.

Terraform Configuration Files

resource "aws_security_group" "base_sg" {
  name        = "base_project_sg_${var.sqsc_project_name}_${var.environment}"
  description = "Basic Security Group for ${var.sqsc_project_name} ${var.environment}"
  vpc_id      = "${data.aws_vpc.main.id}"

  tags {
    Name        = "base_project_sg_${var.sqsc_project_name}_${var.environment}"
    Environment = "${var.environment}"
    Project     = "${var.sqsc_project_name}"
    ProjectUuid = "${var.sqsc_project_uuid}"
  }
}

resource "aws_security_group_rule" "base_sg_ingress_ssh" {
  security_group_id = "${aws_security_group.base_sg.id}"
  type              = "ingress"
  from_port         = 22
  to_port           = 22
  protocol          = "tcp"
  cidr_blocks       = ["0.0.0.0/0"]
}

resource "aws_security_group_rule" "base_sg_ingress_http" {
  security_group_id = "${aws_security_group.base_sg.id}"
  type              = "ingress"
  from_port         = 80
  to_port           = 80
  protocol          = "tcp"
  cidr_blocks       = ["0.0.0.0/0"]
}

// ...

resource "aws_security_group_rule" "base_sg_egress" {
  security_group_id = "${aws_security_group.base_sg.id}"
  type              = "egress"
  from_port         = 0
  to_port           = 0
  protocol          = "-1"
  cidr_blocks       = ["0.0.0.0/0"]
}

Debug Output

This is a transident error with terraform running in an automated environment. We do not have debug output for this run at the moment.

However, we run terraform multiples times, and the first time we run it, we have the following error

1 error(s) occurred:

* aws_security_group.base_sg: 1 error(s) occurred:

* aws_security_group.base_sg: timeout while waiting for state to become 'success' (timeout: 5m0s)

Then all subsequent terraform apply executions fails with:

1 error(s) occurred:

* aws_security_group_rule.base_sg_egress: 1 error(s) occurred:

* aws_security_group_rule.base_sg_egress: [WARN] A duplicate Security Group rule was found on (sg-048b7c7e). This may be
a side effect of a now-fixed Terraform issue causing two security groups with
identical attributes but different source_security_group_ids to overwrite each
other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
information and instructions for recovery. Error message: the specified rule "peer: 0.0.0.0/0, ALL, ALLOW" already exists

Full logs here: https://gist.github.com/mildred/9245356ec1ef599f91eb15f2bd9a6666

Expected Behavior

Terraform should taint the security group if it fails on it due to a timeout so next run will create it anew. Or perhaps just taint the security_group_rules within it. Or it should register the security group rules properly in the tfstate.

Actual Behavior

Terraform timeouts then fails to create the resource because a rule it thought was not present is created.

Steps to Reproduce

Run terraform enough to be throttled by AWS

Important Factoids

Our API requests are being throttled by AWS (we asked the support for that and they confirmed it)
We increased max_retried setting for the aws provider to 40

bug servicec2

Source

mildred

👍10

Most helpful comment

Perhaps an easy way to mitigate this problem could be to allow configurable timeouts on aws_security_group resources. As it is now, it is not configurable:

[ERR] Error decoding timeout: Timeout Key (create) is not supported

mildred on 29 Jan 2018

👍8

All 20 comments

Perhaps an easy way to mitigate this problem could be to allow configurable timeouts on aws_security_group resources. As it is now, it is not configurable:

[ERR] Error decoding timeout: Timeout Key (create) is not supported

mildred on 29 Jan 2018

👍8

I am having a similar issue. Our TF deployments fail intermittently on aws_security_group timeouts. However, it is always a different security group so there is no clear pattern. IMHO just adding a timeout is only half the story if rate throttling indeed is the root cause. Then, TF should gradually decrease the rate at which it fires API calls while waiting to the timeout at the same time

domdom82 on 20 Feb 2018

Just found there is already a max_retries option on the provider here.

However, it still keeps running into the 5min timeout even-though I bumped max_retries. I think the global timeout does not respect the max_retries setting.

domdom82 on 22 Feb 2018

Did some more investigation: I am getting these error messages from time to time:

* aws_security_group.slave: timeout while waiting for state to become 'success' (timeout: 5m0s)

However, when I look at the code of aws_security_group.resourceAwsSecurityGroupCreate in https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_security_group.go#L277 you can see there is a hard-coded time out of 10 minutes for SG creation:

    // Wait for the security group to truly exist
    log.Printf(
        "[DEBUG] Waiting for Security Group (%s) to exist",
        d.Id())
    stateConf := &resource.StateChangeConf{
        Pending: []string{""},
        Target:  []string{"exists"},
        Refresh: SGStateRefreshFunc(conn, d.Id()),
        Timeout: 10 * time.Minute,
    }

    resp, err := stateConf.WaitForState()
    if err != nil {
        return fmt.Errorf(
            "Error waiting for Security Group (%s) to become available: %s",
            d.Id(), err)
    }

So I am wondering: Could this longer timeout be overshadowed by a generic 5 minute timeout so that it does not even apply?

domdom82 on 27 Feb 2018

I dug even deeper and found out that the real timeout occurs not in the security_group but in the tag on the security group:

    resp, err := stateConf.WaitForState()  // this works just fine and adheres the 10 minute timeout
    if err != nil {
        return fmt.Errorf(
            "Error waiting for Security Group (%s) to become available: %s",
            d.Id(), err)
    }

    if err := setTags(conn, d); err != nil {      // this fails
        return err
    }

setTags in turn calls tags.go:

    // Set tags
    if len(remove) > 0 {
        err := resource.Retry(5*time.Minute, func() *resource.RetryError {
            log.Printf("[DEBUG] Removing tags: %#v from %s", remove, d.Id())
(...)
    if len(create) > 0 {
            err := resource.Retry(5*time.Minute, func() *resource.RetryError {
                log.Printf("[DEBUG] Creating tags: %s for %s", create, d.Id())

which has a hard-coded timeout of 5 minutes. Since tags can be on any resource, this timeout acts in fact as a hidden timeout on all AWS resources that have tags. Not cool.

domdom82 on 1 Mar 2018

@mildred this becomes even worse as timeouts can currently only be defined on resources but tags are not modelled as resources. So a potential fix would have to "inherit" the timeout from the resource that wants to be tagged or otherwise use a default value. hmm.

domdom82 on 1 Mar 2018

Opened two PRs that should fix both problems mentioned in this issue.

domdom82 on 2 Mar 2018

We actually see that this happens with creating EIP tags as well. I think #3586 will help facilitate fixing that but we'll need something like #3599 but for EIPs to see it through. Should I make a new issue for that?

2rs2ts on 13 Mar 2018

Does anyone have a hack/workaround to handle this problem (and destroy the resource somehow) while a fix is being developed?

fullybonkers on 14 Mar 2018

I didn't follow up on this issue, sorry. Our solution was to automatically retry failed terraform execution once. It worked for most of the occurrences but we still have a small number of cases where terraform is failing and the second run it continues to fail because the first failure made a half-resource.

This specific case can be identified by the first terraform rule showing an error like:

aws_security_group_rule.base_sg_ingress_ssh: Error finding matching ingress Security Group Rule (sgrule-1225671002) for Group sg-1d6d1567

And the next terraform is failing with:

aws_security_group_rule.base_sg_ingress_ssh: [WARN] A duplicate Security Group rule was found on (sg-1d6d1567). This may be
a side effect of a now-fixed Terraform issue causing two security groups with
identical attributes but different source_security_group_ids to overwrite each
other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
information and instructions for recovery. Error message: the specified rule "peer: 0.0.0.0/0, TCP, from port: 22, to port: 22, ALLOW" already exists

We could dig this up to a hardcoded 5m timeout too on the DescribeSecurityGroups API call. The code here is obvious:

https://github.com/terraform-providers/terraform-provider-aws/blob/faeda15f5259d4925824774f1419f69b0f000d5c/aws/resource_aws_security_group.go#L324-L350

The solution is probably not to have configurable timeouts, but for terraform to honor throttling algorithms for AWS requests. In probably multiple places too...

We contacted AWS support and they told us:

Please be aware as I was indicated recently you have plenty of throttled calls during the time frame you mentioned, not only in relation to pointed API call "DescribeSecurityGroups". To avoid similar issues in the future, please follow this workaround related to situations where high amount of API calls affects efficiency. To deal with it is crucial to implement exponential back off and API retry logic, as well as jitter.

This is a method to stagger your API calls by having each server have a random sleep interval (this is called jitter), and make sure that if it gets throttled or an error occurs, it will wait for a small random interval before trying again (this is implemented as exponential back off). A more detailed description of how to implement this can be found in the documentation found at [1]. You can find more details on implementing these under those links [2] and [3].

[1] https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
[2] http://docs.aws.amazon.com/general/latest/gr/api-retries.html
[3] https://www.awsarchitectureblog.com/2015/03/backoff.html

mildred on 20 Mar 2018

We now have implemented saving debug output on our failed terraform calls. I'll soon attach full debug output here.

mildred on 20 Mar 2018

Did not attach full debug output because it contained credentials. However, there is some debug output I posted in https://github.com/terraform-providers/terraform-provider-aws/pull/3586#issuecomment-375384063 that supports the need of the pull request #3911:

I have the following terraform error:

* aws_security_group_rule.base_sg_ingress_services: Error finding matching ingress Security Group Rule (sgrule-4087802998) for Group sg-8eb99cf4

The debug logs tell me terraform performs the following request:

2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [INFO] Security Group ID: sg-f284a188
2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] Waiting for Security Group (sg-f284a188) to exist
2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] Waiting for state to become: [exists]
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/DescribeSecurityGroups Details:
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: ---[ REQUEST POST-SIGN ]-----------------------------
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: POST / HTTP/1.1
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Host: ec2.eu-west-1.amazonaws.com
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: User-Agent: aws-sdk-go/1.12.62 (go1.9.2; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.2
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Content-Length: 70
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Content-Type: application/x-www-form-urlencoded; charset=utf-8
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: X-Amz-Date: 20180322T050116Z
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Accept-Encoding: gzip
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Action=DescribeSecurityGroups&GroupId.1=sg-8eb99cf4&Version=2016-11-15
2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: -----------------------------------------------------

And the next response in the logs for ec2/DescribeSecurityGroups I get is:

2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DescribeSecurityGroups Details:
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: ---[ RESPONSE ]--------------------------------------
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: HTTP/1.1 503 Service Unavailable
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Connection: close
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Transfer-Encoding: chunked
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Date: Thu, 22 Mar 2018 05:01:16 GMT
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Server: AmazonEC2
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: -----------------------------------------------------
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: <Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>936c5bbe-1957-4911-a928-8d08d3b7d1d3</RequestID></Response>

I would say this confirms the rate limiting is causing the error

mildred on 26 Mar 2018

@mildred terraform does implement an exponential backoff algorithm, for example in waitForState here: https://github.com/hashicorp/terraform/blob/master/helper/resource/state.go#L51

However, I think the algorithm is not used in every API call throughout.

domdom82 on 4 Apr 2018

👍1

@terraform-providers/terraform-provider-aws
Any update on this issue?
Can someone please confirm that the exponential backoff algorithm is used on every call?

heldersepu on 2 Jul 2018

@domdom82 Here the problem is not that the exponential backoff algorithm is not in use, it's that the terraform resource doesn't have a configurable timeout, and that in my case the default timeout is too low.

mildred on 3 Jul 2018

@mildred in your case have you seen Throttling: Rate exceeded ?
I do get that every once in a while with different resources

heldersepu on 3 Jul 2018

No, I'm getting those AWS errors (in the debug output):

<Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>b699395c-cdd1-40ce-a17c-3045da933370</RequestID></Response>

mildred on 13 Aug 2018

Any update on the issue, on how to tune the aws resource delay factor. Pls advise for any steps to be followed to compile the aws-sdk-go pkg to use it locally..