Terraform: Unable to find matching ingress Security Group Rule

Created on 20 Apr 2016 · 12Comments · Source: hashicorp/terraform

Terraform version: 0.6.14 (latest as of now)
Affected resource: aws_security_group_rule

AWS security group rules are not added to the tfstate file, after correctly being created because they don't appear immediately in the security group check (DescribeSecurityGroups request in https://github.com/hashicorp/terraform/blob/master/builtin/providers/aws/resource_aws_security_group_rule.go#L160):

Without looking at the debug log level, this is what the user sees (which is rather misleading):

* aws_security_group_rule.company_executor_yourkit_in: [WARN] A duplicate Security Group rule was found. This may be
a side effect of a now-fixed Terraform issue causing two security groups with
identical attributes but different source_security_group_ids to overwrite each
other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
2016/04/20 01:19:26 [DEBUG] /opt/terraform-0.6.14/terraform-provisioner-file: plugin process exited
information and instructions for recovery. Error message: the specified rule "peer: 0.0.0.0/0, TCP, from port: 10001, to port: 10002, ALLOW" already exists
* aws_security_group_rule.company_general_icmp_in: [WARN] A duplicate Security Group rule was found. This may be
a side effect of a now-fixed Terraform issue causing two security groups with
identical attributes but different source_security_group_ids to overwrite each
other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
information and instructions for recovery. Error message: the specified rule "peer: 0.0.0.0/0, ICMP, type: ALL, code: ALL, ALLOW" already exists

The work-around of including all rules in the security group declaration is not workable for us because there doesn't seem to be a smooth migration path - on apply, the security group loses all old independent rules, while the new ones are not added.

Unfortunately our setup is extremely complex and I was unable to reproduce this issue with a simpler setup. My guess is that AWS needs more time or has a race condition so the request should be retried.

Attaching:

Initial apply log (debug level) where 2 security groups rules are omitted (grule-751804134 and sgrule-2971013440): security.group.first.apply.log.txt
The initial tfstate file which does not contain the 2 security groups: security.group.tfstate.txt
Second apply log (debug level) where it bombs with the error above: security.group.second.apply.log.txt

bug provideaws

Source

cmlad

👍3

Most helpful comment

Just a comment from our side - we run a lot of Terraform stuff in CI where it's much more important for us for things to succeed rather than how long they take. A failure is a dead end as the jobs clean up and we don't have the option to manually intervene (deleting rules, etc). We just restart the job and hope for the best.

cmlad on 21 Apr 2016

👍3

All 12 comments

I was able to work around this by modifying the Terraform source to sleep for 5 seconds before the resourceAwsSecurityGroupRuleCreate method calls resourceAwsSecurityGroupRuleRead.

cmlad on 20 Apr 2016

We just ran into exactly the same issue, thanks for diagnosing it, I can stop going over my syntax for the 15th time to validate it...

jrnt30 on 20 Apr 2016

@jrnt30 this is the really dumb way I use to work around it: https://github.com/cmlad/terraform/commit/8f6bf99e4661077e544d2c37eec108e6a4a58006 (also works around https://github.com/hashicorp/terraform/issues/1815, which has the same problem)

cmlad on 20 Apr 2016

Appreciate it, will dig in and see if what we can learn. There area few other resources are a bit "smarter" about the retry/population of state file which may be able to be stolen from (or generalized and reused).

jrnt30 on 20 Apr 2016

Hey all – thanks for the information. Any additional information or reproducible configs would help. We have pretty good testing around here so I'd like to know where the holes are.

We do wrap our create/read operations in a mutex _per security group_, so concurrency is not _likely_ at fault here, hopefully :)

Is this reproducible in your large setup (without causing much harm?) If so, and you're comfortable building your own version of Terraform, a debugging line here to output _what_ permissions received this ID, may help in tracking down specifically the ones that are missing.

From searching your logs I _thought_ it was an icmp protocol one?

catsby on 20 Apr 2016

@catsby, it happened quite often and i can def build a custom Terraform.

I'm not sure however what change to make to print the permissions received with the ID. Could you maybe point me?

I do not think this is a race condition - I think it is an AWS issues, as in AWS says yes I created that rule for you and associated it to the group, but they on the following call doesn't return the rule as part of the group and Terraform therefore throws the Unable to find matching ingress Security Group Rule error.

cmlad on 20 Apr 2016

I too am able to run a custom build. It is certainly reproducible in our stack and have essentially the same situation as cmlad. Output shows aws_security_group_rule being created, AWS SG is properly updated but there is no output written.

One of the more complex scenario we have (but is not the only one that fails) is that we do not always create the SG in the stack we are creating the rule in. I.E.

these are two distinct runs with completely different TF configs.
The first creates the network and base security groups.
The second is application specific deployment which uses terraform_remote_state to consume the outputs of the security group that network created and augments that SG's rule set with it's

I'll take a stab at looking at the mutex piece but make not promises, other than I am willing and able to build a new version with some guidance as to what you'd like to see.

jrnt30 on 20 Apr 2016

@catsby I was working through a retry policy (inspired by the AMI resource) and @cmlad's investigation for this but am still seeing somewhat inconsistent results even with 5 retires @ 4 sec, but I did find a very easy way to reproduce some of this.

If you run the TestAccAWSSecurityGroupRule_Race with TF_LOG=debug a few times you should see some of the "Unable to find matching ingress Security Group Rule" output.

jrnt30 on 21 Apr 2016

I have not been able to get it to fail every time, but the TestAccAWSSecurityGroupRule_Race does have at least one SG 1 failure in need of a retry ~ 50% of the time.

Initial thought was that a retry of up to 2 minutes should certainly be enough however using https://github.com/jrnt30/terraform/commit/c1bb8b46b0932c6522340adf636e3291ae764ffe I have seen a few runs take over 20 attempts (max thus far has been 37).

Personally although it's annoying I would prefer my run to be significantly slower but consistent.

The other thing I have been thinking about is, should resourceAwsSecurityGroupRuleCreate check for the existence of a matching rule before attempting the create or some similar function to allow the state file to be updated and remove the warnings on subsequent runs?

jrnt30 on 21 Apr 2016

cmlad on 21 Apr 2016

👍3

Hi friends – I merged #6325 which should help here, and will go out in the next release (sometime this week, I believe).

Let me know if this helps, or if there is anything else! Thanks!

catsby on 4 May 2016

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.