Short story: we know that AWS is throttling our API requests. Sometimes we timeout on creating a security group. The problem is however that subsequent terraform runs are failing because the security group was created but is not completely present in tfstate. Security group rules are not recorded in tfstate.
terraform 0.11.2
aws provider ersion 1.7.1
aws_security_group
There might be a problem on how terraform handles resources that fails. perhaps on failure this resource should be tainted so subsequent runs succeeds.
resource "aws_security_group" "base_sg" {
name = "base_project_sg_${var.sqsc_project_name}_${var.environment}"
description = "Basic Security Group for ${var.sqsc_project_name} ${var.environment}"
vpc_id = "${data.aws_vpc.main.id}"
tags {
Name = "base_project_sg_${var.sqsc_project_name}_${var.environment}"
Environment = "${var.environment}"
Project = "${var.sqsc_project_name}"
ProjectUuid = "${var.sqsc_project_uuid}"
}
}
resource "aws_security_group_rule" "base_sg_ingress_ssh" {
security_group_id = "${aws_security_group.base_sg.id}"
type = "ingress"
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
resource "aws_security_group_rule" "base_sg_ingress_http" {
security_group_id = "${aws_security_group.base_sg.id}"
type = "ingress"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
// ...
resource "aws_security_group_rule" "base_sg_egress" {
security_group_id = "${aws_security_group.base_sg.id}"
type = "egress"
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
This is a transident error with terraform running in an automated environment. We do not have debug output for this run at the moment.
However, we run terraform multiples times, and the first time we run it, we have the following error
1 error(s) occurred:
* aws_security_group.base_sg: 1 error(s) occurred:
* aws_security_group.base_sg: timeout while waiting for state to become 'success' (timeout: 5m0s)
Then all subsequent terraform apply
executions fails with:
1 error(s) occurred:
* aws_security_group_rule.base_sg_egress: 1 error(s) occurred:
* aws_security_group_rule.base_sg_egress: [WARN] A duplicate Security Group rule was found on (sg-048b7c7e). This may be
a side effect of a now-fixed Terraform issue causing two security groups with
identical attributes but different source_security_group_ids to overwrite each
other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
information and instructions for recovery. Error message: the specified rule "peer: 0.0.0.0/0, ALL, ALLOW" already exists
Full logs here: https://gist.github.com/mildred/9245356ec1ef599f91eb15f2bd9a6666
Terraform should taint the security group if it fails on it due to a timeout so next run will create it anew. Or perhaps just taint the security_group_rules within it. Or it should register the security group rules properly in the tfstate.
Terraform timeouts then fails to create the resource because a rule it thought was not present is created.
Run terraform enough to be throttled by AWS
max_retried
setting for the aws provider to 40Perhaps an easy way to mitigate this problem could be to allow configurable timeouts on aws_security_group resources. As it is now, it is not configurable:
[ERR] Error decoding timeout: Timeout Key (create) is not supported
I am having a similar issue. Our TF deployments fail intermittently on aws_security_group
timeouts. However, it is always a different security group so there is no clear pattern. IMHO just adding a timeout is only half the story if rate throttling indeed is the root cause. Then, TF should gradually decrease the rate at which it fires API calls while waiting to the timeout at the same time
Just found there is already a max_retries
option on the provider here.
However, it still keeps running into the 5min timeout even-though I bumped max_retries
. I think the global timeout does not respect the max_retries
setting.
Did some more investigation: I am getting these error messages from time to time:
* aws_security_group.slave: timeout while waiting for state to become 'success' (timeout: 5m0s)
However, when I look at the code of aws_security_group.resourceAwsSecurityGroupCreate
in https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_security_group.go#L277 you can see there is a hard-coded time out of 10 minutes for SG creation:
// Wait for the security group to truly exist
log.Printf(
"[DEBUG] Waiting for Security Group (%s) to exist",
d.Id())
stateConf := &resource.StateChangeConf{
Pending: []string{""},
Target: []string{"exists"},
Refresh: SGStateRefreshFunc(conn, d.Id()),
Timeout: 10 * time.Minute,
}
resp, err := stateConf.WaitForState()
if err != nil {
return fmt.Errorf(
"Error waiting for Security Group (%s) to become available: %s",
d.Id(), err)
}
So I am wondering: Could this longer timeout be overshadowed by a generic 5 minute timeout so that it does not even apply?
I dug even deeper and found out that the real timeout occurs not in the security_group but in the tag on the security group:
resp, err := stateConf.WaitForState() // this works just fine and adheres the 10 minute timeout
if err != nil {
return fmt.Errorf(
"Error waiting for Security Group (%s) to become available: %s",
d.Id(), err)
}
if err := setTags(conn, d); err != nil { // this fails
return err
}
setTags
in turn calls tags.go:
// Set tags
if len(remove) > 0 {
err := resource.Retry(5*time.Minute, func() *resource.RetryError {
log.Printf("[DEBUG] Removing tags: %#v from %s", remove, d.Id())
(...)
if len(create) > 0 {
err := resource.Retry(5*time.Minute, func() *resource.RetryError {
log.Printf("[DEBUG] Creating tags: %s for %s", create, d.Id())
which has a hard-coded timeout of 5 minutes. Since tags can be on any resource, this timeout acts in fact as a hidden timeout on all AWS resources that have tags. Not cool.
@mildred this becomes even worse as timeouts can currently only be defined on resources but tags are not modelled as resources. So a potential fix would have to "inherit" the timeout from the resource that wants to be tagged or otherwise use a default value. hmm.
Opened two PRs that should fix both problems mentioned in this issue.
We actually see that this happens with creating EIP tags as well. I think #3586 will help facilitate fixing that but we'll need something like #3599 but for EIPs to see it through. Should I make a new issue for that?
Does anyone have a hack/workaround to handle this problem (and destroy the resource somehow) while a fix is being developed?
I didn't follow up on this issue, sorry. Our solution was to automatically retry failed terraform execution once. It worked for most of the occurrences but we still have a small number of cases where terraform is failing and the second run it continues to fail because the first failure made a half-resource.
This specific case can be identified by the first terraform rule showing an error like:
- aws_security_group_rule.base_sg_ingress_ssh: Error finding matching ingress Security Group Rule (sgrule-1225671002) for Group sg-1d6d1567
And the next terraform is failing with:
- aws_security_group_rule.base_sg_ingress_ssh: [WARN] A duplicate Security Group rule was found on (sg-1d6d1567). This may be
a side effect of a now-fixed Terraform issue causing two security groups with
identical attributes but different source_security_group_ids to overwrite each
other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
information and instructions for recovery. Error message: the specified rule "peer: 0.0.0.0/0, TCP, from port: 22, to port: 22, ALLOW" already exists
We could dig this up to a hardcoded 5m timeout too on the DescribeSecurityGroups API call. The code here is obvious:
The solution is probably not to have configurable timeouts, but for terraform to honor throttling algorithms for AWS requests. In probably multiple places too...
We contacted AWS support and they told us:
Please be aware as I was indicated recently you have plenty of throttled calls during the time frame you mentioned, not only in relation to pointed API call "DescribeSecurityGroups". To avoid similar issues in the future, please follow this workaround related to situations where high amount of API calls affects efficiency. To deal with it is crucial to implement exponential back off and API retry logic, as well as jitter.
This is a method to stagger your API calls by having each server have a random sleep interval (this is called jitter), and make sure that if it gets throttled or an error occurs, it will wait for a small random interval before trying again (this is implemented as exponential back off). A more detailed description of how to implement this can be found in the documentation found at [1]. You can find more details on implementing these under those links [2] and [3].
[1] https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
[2] http://docs.aws.amazon.com/general/latest/gr/api-retries.html
[3] https://www.awsarchitectureblog.com/2015/03/backoff.html
We now have implemented saving debug output on our failed terraform calls. I'll soon attach full debug output here.
Did not attach full debug output because it contained credentials. However, there is some debug output I posted in https://github.com/terraform-providers/terraform-provider-aws/pull/3586#issuecomment-375384063 that supports the need of the pull request #3911:
I have the following terraform error:
* aws_security_group_rule.base_sg_ingress_services: Error finding matching ingress Security Group Rule (sgrule-4087802998) for Group sg-8eb99cf4
The debug logs tell me terraform performs the following request:
2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [INFO] Security Group ID: sg-f284a188 2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] Waiting for Security Group (sg-f284a188) to exist 2018-03-22T05:01:16.781Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] Waiting for state to become: [exists] 2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] DEBUG: Request ec2/DescribeSecurityGroups Details: 2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: ---[ REQUEST POST-SIGN ]----------------------------- 2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: POST / HTTP/1.1 2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Host: ec2.eu-west-1.amazonaws.com 2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: User-Agent: aws-sdk-go/1.12.62 (go1.9.2; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.2 2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Content-Length: 70 2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Content-Type: application/x-www-form-urlencoded; charset=utf-8 2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: X-Amz-Date: 20180322T050116Z 2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Accept-Encoding: gzip 2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Action=DescribeSecurityGroups&GroupId.1=sg-8eb99cf4&Version=2016-11-15 2018-03-22T05:01:16.782Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: -----------------------------------------------------
And the next response in the logs for ec2/DescribeSecurityGroups I get is:
2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] DEBUG: Response ec2/DescribeSecurityGroups Details: 2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: ---[ RESPONSE ]-------------------------------------- 2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: HTTP/1.1 503 Service Unavailable 2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Connection: close 2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Transfer-Encoding: chunked 2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Date: Thu, 22 Mar 2018 05:01:16 GMT 2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: Server: AmazonEC2 2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: ----------------------------------------------------- 2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: 2018/03/22 05:01:16 [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?> 2018-03-22T05:01:16.793Z [DEBUG] plugin.terraform-provider-aws_v1.7.1_x4: <Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>936c5bbe-1957-4911-a928-8d08d3b7d1d3</RequestID></Response>
I would say this confirms the rate limiting is causing the error
@mildred terraform does implement an exponential backoff algorithm, for example in waitForState
here: https://github.com/hashicorp/terraform/blob/master/helper/resource/state.go#L51
However, I think the algorithm is not used in every API call throughout.
@terraform-providers/terraform-provider-aws
Any update on this issue?
Can someone please confirm that the exponential backoff algorithm is used on every call?
@domdom82 Here the problem is not that the exponential backoff algorithm is not in use, it's that the terraform resource doesn't have a configurable timeout, and that in my case the default timeout is too low.
@mildred in your case have you seen Throttling: Rate exceeded
?
I do get that every once in a while with different resources
No, I'm getting those AWS errors (in the debug output):
<Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>b699395c-cdd1-40ce-a17c-3045da933370</RequestID></Response>
Any update on the issue, on how to tune the aws resource delay factor. Pls advise for any steps to be followed to compile the aws-sdk-go pkg to use it locally..
@aeschright @bflad @ryndaniels any plans to make progress on this one soon please ?
Is the fix for this issue being merged for an upcoming release? It would be great since it leaves quite a mess when this occurs.
Most helpful comment
Perhaps an easy way to mitigate this problem could be to allow configurable timeouts on aws_security_group resources. As it is now, it is not configurable: