Terraform: Bug: Terraform attempts to delete security groups before dependent EC2 instances

Created on 2 Sep 2016  ยท  34Comments  ยท  Source: hashicorp/terraform

I have a configuration of EC2 instances belonging to security groups in AWS:

resource "aws_instance" "foo" {
    # some config
    vpc_security_group_ids      = ["${aws_security_group.foo.id}", "${var.vpc-sg}"]
}

resource "aws_security_group" "foo" {
    # some config
    vpc_id = "${var.vpc-id}"
}

When running terraform destroy, Terraform attempts to destroy the security groups until timeout (5 minutes), at which point it prints the following error:

* aws_security_group.foo: DependencyViolation: resource sg-4f189d35 has a dependent object
        status code: 400, request id: 0a06f6a1-0792-42c5-9180-beac33fb9037

Indeed, the instances (which Terraform has not yet attempted to destroy) are dependent on the security group. Given that the configuration for these resources is maintained completely in Terraform, it seems to be some bug with the dependency resolution. I wonder if this may have anything to do with the VPC.

bug core provideaws v0.11

Most helpful comment

A combination of create before destroy and name_prefix (so the security groups don't have conflicting names) solves this for me:

resource "aws_security_group" "example" {
  name_prefix = "example-"
  // other stuff

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_instance" "example" {
  vpc_security_group_ids = ["${aws_security_group.example.id}"]
  // other stuff
}

Then the new SG gets created, swapped out on the ENI for the EC2 instance and then the old SG can be deleted.

All 34 comments

@hydroxide I don't experience this problem with 0.7.0; do you have any other useful information such as a debug log (gist link) or other information? I use this dependency quite extensively in a lot of plans and haven't run into this unless something is modifying the SGs outside of management.

I am facing a similar issue with terraform 0.7.3, when conditionally adding an extra security group.
When the extra security group exists, atttempting to remove it fails due do a dependency violation.

The terraform plan shows that both the extra security group will be deleted and disassociated from the EC2 instances:

- aws_security_group.cassandra_overrides

~ module.aws_instance_zone_a.aws_instance.cassandra_node.0
    vpc_security_group_ids.#:          "2" => "1"
    vpc_security_group_ids.2978748095: "sg-4ccbcf2b" => "sg-4ccbcf2b"
    vpc_security_group_ids.3169002004: "sg-b6e3ebd1" => ""

~ module.aws_instance_zone_a.aws_instance.cassandra_node.1
    vpc_security_group_ids.#:          "2" => "1"
    vpc_security_group_ids.2978748095: "sg-4ccbcf2b" => "sg-4ccbcf2b"
    vpc_security_group_ids.3169002004: "sg-b6e3ebd1" => ""

~ module.aws_instance_zone_b.aws_instance.cassandra_node.0
    vpc_security_group_ids.#:          "2" => "1"
    vpc_security_group_ids.2978748095: "sg-4ccbcf2b" => "sg-4ccbcf2b"
    vpc_security_group_ids.3169002004: "sg-b6e3ebd1" => ""

~ module.aws_instance_zone_b.aws_instance.cassandra_node.1
    vpc_security_group_ids.#:          "2" => "1"
    vpc_security_group_ids.2978748095: "sg-4ccbcf2b" => "sg-4ccbcf2b"
    vpc_security_group_ids.3169002004: "sg-b6e3ebd1" => ""

Yet applying the changes attemps to remove the extra security group before it is removed from the dependent EC2 instance.

aws_security_group.cassandra_overrides: Still destroying... (5m0s elapsed)
Error applying plan:

1 error(s) occurred:

* aws_security_group.cassandra_overrides: DependencyViolation: resource sg-b6e3ebd1 has a dependent object
    status code: 400, request id: b7a55106-963a-4404-aa0b-2653bc70ccf1

Here is a simplified version of my terraform config.
The idea here is to use the allow_all_internal_ips variable to define whether an extra security group should be added

_variables.tf_

variable security_rule_overrides {
  type = "map"

  default = {
    allow_all_internal_ips = 1
  }
}

_security_group.tf_

resource "aws_security_group" "cassandra" {
  description = "Cassandra node security group"
  vpc_id      = "${data.terraform_remote_state.vpc.vpc_id}"

  ingress {
    protocol  = "tcp"
    from_port = 9042
    to_port   = 9042

    cidr_blocks = [
      "${data.terraform_remote_state.vpc.vpc_cidr_block}",
    ]
  }
}

resource "aws_security_group" "cassandra_overrides" {
  description = "Cassandra node security group overrides"
  vpc_id      = "${data.terraform_remote_state.vpc.vpc_id}"
  count       = "${lookup(var.security_rule_overrides, "allow_all_internal_ips", 0)}"

  ingress {
    protocol  = "tcp"
    from_port = 9042
    to_port   = 9042

    cidr_blocks = [
      "10.0.0.0/8",
    ]
  }
}

_nodes.tf_

module "aws_instance_zone_a" {
  source                 = "./zone_instances"
  zone_name              = "a"
  environment            = "${var.environment}"
  vpc_security_group_ids = "${aws_security_group.cassandra.id}, ${join(", ", compact(aws_security_group.cassandra_overrides.*.id))}"
  subnet_id              = "${data.terraform_remote_state.vpc.subnet_id_private_a}"
  instance_type          = "${var.instance_types[var.environment]}"
  instance_count         = "${lookup(var.instances, "${var.environment}_zone_a", 0)}"
  # other config
}

_zone_instances/main.tf_

resource "aws_instance" "cassandra_node" {
  ami                    = "${var.ami}"
  instance_type          = "${var.instance_type}"
  subnet_id              = "${var.subnet_id}"
  vpc_security_group_ids = ["${compact(split(", ", var.vpc_security_group_ids))}"]
  key_name               = "${var.key_name}"
  count                  = "${var.instance_count}"
  # other config
}

This is happening to me, too. In my case, I'm trying to rename a security group, which requires that Terraform destroy the group and recreate it. However, all instances have to be removed from the group and all references to the group must be removed before it can be destroyed. Terraform doesn't remove instances from the group or remove references to the group before it tries to destroy it. Error message:

* aws_security_group.jump: DependencyViolation: resource sg-9691e2ee has a dependent object
    status code: 400, request id: 4de9318b-fee9-48a4-90dd-46fc6336298b

I believe the answer is in explicit dependencies:

Terraform ensures that dependencies are successfully created before a resource is created. _During a destroy operation, Terraform ensures that this resource is destroyed before its dependencies._

Emphasis mine. Sadly, this doesn't seem to work for me (though I'm mentioning a few resources and only needing to destroy a couple).

Same issue here with latest version (v0.10.8).

Ran into this issue on 0.10.8 as well.

A combination of create before destroy and name_prefix (so the security groups don't have conflicting names) solves this for me:

resource "aws_security_group" "example" {
  name_prefix = "example-"
  // other stuff

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_instance" "example" {
  vpc_security_group_ids = ["${aws_security_group.example.id}"]
  // other stuff
}

Then the new SG gets created, swapped out on the ENI for the EC2 instance and then the old SG can be deleted.

Thanks @b-dean! This approach worked for me as well.

@hydroxide Any chance you can re-open this? It's happening for lots of us, apparently (me included). It's very simple to reproduce, simply remove a security group that's assigned to at least one server. Terraform hangs until its timeout. It's possible to workaround by manually (AWS UI) removing the assignment in parallel, but that wouldn't be fun for a significant number of instances.

I am currently experiencing this same behavior. Renamed a security group attached to a instance resource; terraform detects I'm no longer using said security group and hangs until it times out in an attempt to delete it. If I remove the security group in aws console or destroy the instance, terraform returns successfully.

I am also experiencing a similar issue in Terraform v0.11.7. I am unable to delete existing security groups that are no longer applicable to an EC2 instance dynamically based on a boolean variable for an environment. "Terraform plan" seems to be Ok, but "Terraform apply" seems to be producing an error after 10 mins. I think terraform is trying to delete the security groups first and then detach it from the EC2 instance. It should be the other way round. For now, I am now manually detaching the security groups from the EC2 instance from AWS Console and then running "terraform apply", which seems to work.

Exactly the same happens when the security group is in use by ECS service.
Terraform doesn't detach it first, thus unable to destroy the security group.

I have just run into this issue as well while writing tests for a custom terraform provider. I wrote a test that, as part of the same step, removed a link from resource A->B and deleted resource B. However, terraform attempted to delete B prior to the update of A finishing.

This behaviour seems very counter intuitive to me. While it probably leads to faster running applys in some (many?) cases, it seems that without a lot more metadata being provided by either the end-user or the providers, terraform would have no way to know when this is safe and thus should default to the safer behaviour of waiting for updates to all resources which depend on a resource scheduled for removal before actually proceeding with that removal.

After some searching around I've found the following issues that I think are either this same issue, or are closely related. In some cases, create_before_destroy successfully works around this issue, but I don't believe it would help the case I've described above.
Similar/identical: terraform-providers/terraform-provider-aws#4852
Related: #17614 #532

I'm facing this issue when destroying my cloud lab which is completely deployed using Terraform (VPC, Route Tables, Subnets, EIP, EC2, EBS, Route53). Looks like Terraform is trying to remove Route Table before removing the EC2 instances or Subnets.

I just hit this with * provider.aws: version = "~> 1.26" and terraform 0.11.8

Also experiencing this on:
Terraform v0.11.8
provider.aws v1.39.0

In my case there are left over network interfaces (not in-use), so terraform hangs trying to delete associated security groups.

+1

+1

I'm experiencing this issue with Terraform v0.11.8 and provider.AWS v1.41.0.

Also in

Terraform v0.11.10
+ provider.aws v1.41.0

Any chance this will be fixed in future releases? It's been an issue for a while now and even lifecycle hooks with 'create_before_destroy' won't work.

Hi all,

From reading over the comment thread here it seems like there are a few subtly different problems being described:

  • The original issue was that during terraform destroy (presumably with a plan to destroy both the instance _and_ the security group) Terraform attempted to destroy the security group first, even though the EC2 instance depends on it. This does seem like a Terraform core bug, since there's nothing the provider could do to help with this.
  • It seems like some of you, on the other hand, are talking about the situation where both an instance and a security group are already present in state and the security group has been removed from config while the instance remains. In this case, Terraform should be processing the update of the EC2 instance security groups first, before attempting to destroy the security group. This could either be a Terraform Core problem (dependency edges inverted) or it could be an AWS provider bug: I know from previous experience that detaching objects from a security group does not have an immediate effect in EC2, and so there can be a delay before the group becomes deleteable.

In order to understand better what's going on here, it would be helpful if at least one of the participants in this thread for each of those situations could capture a trace log during a failing operation. To do that:

  • Run terraform apply with the environment variable TF_LOG=trace set.
  • Capture all of the output (the log output and Terraform's usual CLI output) and paste it into a gist (since log output is too verbose for GitHub comments)
  • Share the link to the gist here along with an indication of which of the above two bugs you are seeing (or, if you're seeing some other situation with similar symptoms, a description of that situation.)

The log includes detailed information about how Terraform is constructing the graph, which will allow us to see whether there is indeed a bug in the graph construction (dependencies in the wrong order, or missing) or if something more subtle and resource-type-specific is going on here.

Thanks to everyone for sharing descriptions of the problem above, and sorry for the delay in responding here.

My use case (simplified):

File my_ec2.tf

  • resource ec2_instance from slightly adjusted terraform ec2_instance module, therefore calling this module locally from modules folder.

    • input vpc_security_group_ids = ["${aws_security_group.my_sg.id}"]

  • resource ec2_security_group "my_sg"

terraform plan creates resources correctly in proper order.
mv my_ec2.tf my_ec2.tf.removed
terraform plan -out remove.plan selects all missing resources successfully and marks them for removal.
terraform apply remove.plan Terraform tries to remove the security group first.

I've checked terraform.tfstate file in order to see if dependencies were correctly detected, however module ec2_cluster identified dependencies are only internal module resources:

            "path": [
                "root",
                "ec2_cluster"
            ],
            "outputs": {},
            "resources": {
                "aws_instance.this_t2": {
                    "type": "aws_instance",
                    "depends_on": [
                        "local.is_t2_instance_type"
                     ],

Security group dependency detected only on another security group referenced in the "my_sg".

My intention is to reuse my_ec2.tf as module later because it is a complete stack of ALB, EC2, EBS, RT53 for, let's say, standalone Apache httpd server.
In case this simplified explanation wouldn't help, I will try what I can do from the above traces. Unfortunately, I have no time available for further debugging, so I cannot promise anything right now.

I hit the same problem with security group list in a rds vpc_security_group_ids.
I wanted to add an extra group based on a flag to add access to the developper in dev network.
when I put that flag to false, terraform try to destroy the security group before removing it from the rds object. and it failed because of the depency.

+1 -- running into this problem when I have an RDS instance with two security groups, and I would like to remove one. So, @apparentlymart this in essence the same as the latter case you mentioned: terraform tries to delete the security group before removing the security group from the RDS instance.

I am a bit hesitant to add the log output from TF_TRACE -- I need to audit it first to make sure no undesired information is included first (e.g., from connecting to the state S3 bucket, in my case). Ideally there would be some way to grep for exactly the lines that are relevant to seeing if there is a bug in building the dependency graph. I would feel more comfortable providing just that, if it is possible.

In the logs you should find some lines containing the string TransitiveReductionTransformer. This will appear once for each graph Terraform built during the operation. I'm particularly interested in the lines that state something like "Graph after step *TransitiveReductionTransformer" (I don't have the exact wording to hand, but grepping for that type name should reduce it down), which will be followed by a list of graph nodes followed by the names of nodes they are connected to (indented).

If you can share the entire list of nodes after that initial log line, that will at least allow me to see whether there is the expected dependency edges between the resources, though if it turns out that there is then we may need additional detail to fully explain it.

If the configuration has other objects in it aside from the RDS instance and security groups, it would help also to know the addresses of the resources in question (so we can easily identify them in the list) and, ideally, the full configuration sources for those resources so we can see how the dependencies between them are declared.

Thanks!

Terraform v0.11.11 + provider.aws v1.45.0

My use case was to modify my security group and then apply. I'm hitting the same DependencyViolation while trying to destroy the security group. Trace log extract after TransitiveReductionTransformer are at https://gist.github.com/schley2103/5834f2f0b7c590352c2be4f7cb717594.js. It built 5 graphs.

Thanks!

Thanks @b-dean. This issue is the number one Google result for "terraform aws_security_group create before destroy" to see if that would resolve the problem where Terraform hangs indefinitely when you re-create a security group attached to running instances. Seems like a valid workaround in the interim.

I think @b-dean 's solution also works with name, not just with name-prefix.

Also happens with aws_rds_cluster. Terraform 0.12

Happening to me too on 0.12. If an RDS is using a security group, that group cannot ever be destroyed. It's immortal.

Happening to me with this config:

terraform -v
Terraform v0.12.7
+ provider.aws v2.25.0

In my use case, I already have an ec2 instance and a security group attached to it; its failing trying to destroy the sg when I made changes to it, without detaching it from the instance first.

@jbardin can you reopen the issue? The bug is still present in:

Terraform v0.12.23
+ provider.aws v2.53.0

Example:

resource "aws_instance" "test" {
  ami           = "ami-077a5b1762a2dde35" # Ubuntu 10.04 Bionic
  instance_type = "t2.micro"
  vpc_security_group_ids = [aws_security_group.test_sg.id]
}

resource "aws_eip" "ip" {
  vpc      = true
  instance = aws_instance.test.id
}

resource "aws_vpc" "main" {
  cidr_block = "172.31.0.0/16"

  tags = {
    Name = "main"
  }
}

resource "aws_security_group" "test_sg" {
  name        = "rules"
  description = "Traffic rules"
  vpc_id      = aws_vpc.main.id

  ingress {
    description = "HTTP"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"

    cidr_blocks = [
      "0.0.0.0/0"
    ]
    ipv6_cidr_blocks = [
      "::/0",
    ]
  }
}

I'm going to lock this issue because it has been closed for _30 days_ โณ. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rnowosielski picture rnowosielski  ยท  3Comments

pawelsawicz picture pawelsawicz  ยท  3Comments

carl-youngblood picture carl-youngblood  ยท  3Comments

rjinski picture rjinski  ยท  3Comments

ketzacoatl picture ketzacoatl  ยท  3Comments