Terraform-aws-eks: Destroy never succeeds, DependencyViolation for Security Group

Created on 25 Feb 2019  Â·  29Comments  Â·  Source: terraform-aws-modules/terraform-aws-eks

I have issues

I'm submitting a...

  • [x] bug report
  • [ ] feature request
  • [x] support request
  • [x] kudos, thank you, warm fuzzy

What is the current behavior?

A cluster cannot be destroyed without manual intervention

If this is a bug, how to reproduce? Please include a code sample if relevant.

Given this (stripped down but working version) of the cluster:

data "aws_region" "current" {}

data "aws_availability_zones" "az" {}

locals {
  worker_groups_launch_template = [
    {
      instance_type        = "t2.small"
      subnets              = "${join(",", module.vpc.private_subnets)}"
      asg_desired_capacity = "2"
    },
    {
      instance_type                            = "t2.small"
      subnets                                  = "${join(",", module.vpc.private_subnets)}"
      override_instance_type                   = "t3.small"
      asg_desired_capacity                     = "2"
      spot_instance_pools                      = 10
      on_demand_percentage_above_base_capacity = "0"
    },
  ]
}

module "vpc" {
  source             = "terraform-aws-modules/vpc/aws"
  version            = "1.57.0"
  name               = "test-eks-thingy"
  cidr               = "192.168.0.0/16"
  azs                = "${data.aws_availability_zones.az.names}"
  private_subnets    = ["192.168.1.0/24", "192.168.2.0/24", "192.168.3.0/24"]
  public_subnets     = ["192.168.4.0/24", "192.168.5.0/24", "192.168.6.0/24"]
  enable_nat_gateway = true
  single_nat_gateway = true
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "2.2.1"

  cluster_version = "1.11"
  cluster_name    = "test-eks-cluster"
  subnets         = ["${module.vpc.private_subnets}"]
  vpc_id          = "${module.vpc.vpc_id}"

  worker_groups_launch_template        = "${local.worker_groups_launch_template}"
  worker_group_launch_template_count   = "1"

  map_roles          = []
  map_roles_count    = 0
  map_users          = []
  map_users_count    = 0
  map_accounts       = []
  map_accounts_count = 0
}

terraform apply completes fine, however terraform destroy fails with

module.eks.aws_security_group.workers: Still destroying... (ID: sg-0e5f0b620ea6e8bc0, 9m50s elapsed)
module.eks.aws_security_group.workers: Still destroying... (ID: sg-0e5f0b620ea6e8bc0, 10m0s elapsed)

Error: Error applying plan:

1 error(s) occurred:

* module.eks.aws_security_group.workers (destroy): 1 error(s) occurred:

* aws_security_group.workers: DependencyViolation: resource sg-0e5f0b620ea6e8bc0 has a dependent object
        status code: 400, request id: bc0096af-9940-42bf-a727-e1a51c9d21b3

Network Interfaces in the AWS console ends up looking like this:
image

Manually detaching the green interfaces and deleting them all allows terraform to complete destruction.

What's the expected behavior?

Cluster can be destroyed entirely by terraform itself.

Are you able to fix this problem and submit a PR? Link here if you have already.

Environment details

  • Affected module version: 2.2.1
  • OS: macOS 10.14.3
  • Terraform version:
Terraform v0.11.11
+ provider.aws v1.60.0
+ provider.local v1.1.0
+ provider.null v2.0.0
+ provider.template v2.0.0

Any other relevant info

Resources remaining after attempted destroy:

Terraform will perform the following actions:

  - module.eks.aws_eks_cluster.this

  - module.eks.aws_iam_role.cluster

  - module.eks.aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy

  - module.eks.aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy

  - module.eks.aws_security_group.cluster

  - module.eks.aws_security_group.workers

  - module.vpc.aws_subnet.private[0]

  - module.vpc.aws_subnet.private[1]

  - module.vpc.aws_subnet.private[2]

  - module.vpc.aws_vpc.this
stale

Most helpful comment

This is happening for me too, and I think I know why.

I setup my cluster to have private access only. The ENIs that hang around and prevent deletion of the SG are created by Amazon accounts. I suspect they're created in order to allow access from the workers to the endpoint via private IPs.

In any case, it seems to be an order-of-operations issue here, as if you first manually destroy the EKS cluster (via console or CLI), the ENIs disappear and destruction of all other resources proceeds without issue. Of course, that confuses things because destroying the cluster first and then the workers doesn't make much sense. Or maybe it doesn't make a difference? That could be a solution to this.

All 29 comments

Im also facing this issue and I can attest the same behavior.

Not sure if this is an issue with the module itself. I had the same problem and to solve it I had to remove all the kubernetes services before destroying it.

@luisc09 manually?

Yes manually. If you create a service with type ELB then k8s will create a security group for this ELB. And this will stop the destroy process.

latest run of eks cluster creation followed by destroy is successful. Not sure what has changed, but this didnt work before.

Note: In my case, i haven't deployed any apps after provisioning the cluster.

Not sure what has changed, but this didnt work before.

OK no worries. Feel free to debug and add info here.

I have just used this module, since I have moved from premises, and trying to create an eks cluster with terraform. In my case I have used a little modification of the example fixture, apply and then destroy with any other interaction with the eks cluster. I got two DependencyViolation error with security groups attached with interfaces.

Hi,
Just tested the merge request #311
I have tested it and it fixes my issue with the _DependencyViolation_, so I can destroy the cluster without any peroblem.

OK cool, then perhaps we merge that PR to solve this issue. It sounds like it would be a popular option.

Question: Don't you have left over ENIs and security groups after cluster is destroyed?

Hi, I don't think so. The creation is very simple since it was just for a PoC.
You can get more into the code here

Bellow are some fragments of the terraform code

locals {
  tags = {
    Environment = "${var.environment}"
    Owner       = "${var.owner}"
    Workspace   = "${var.cluster_name}"
  }
  worker_groups = [
    {
      instance_type        = "${var.instance_type}"
      key_name             = "${var.key_name}"
      subnets              = "${join(",", var.subnets)}"
      additional_userdata  = "${file("${path.module}/user_data.sh")}"
      asg_desired_capacity = "${var.asg_desired_capacity}"
    },
  ]
  worker_groups_launch_template = [
    {
      instance_type                            = "${var.instance_type}"
      key_name                                 = "${var.key_name}"
      subnets                                  = "${join(",", var.subnets)}"
      additional_userdata                      = "${file("${path.module}/user_data.sh")}"
      asg_desired_capacity                     = "${var.asg_spot_desired_capacity}"
      spot_instance_pools                      = "${var.spot_instance_pools}"
      on_demand_percentage_above_base_capacity = "0"
    },
]
}

module "eks" {
  source                               = "./terraform-aws-eks"
  #source                               = "terraform-aws-modules/eks/aws"
  #version                              = "2.3.1" 
  cluster_name                         = "${var.cluster_name}"
  subnets                              = ["${var.subnets}"]
  vpc_id                               = "${var.vpc_id}"
  worker_groups                        = "${local.worker_groups}"
  worker_groups_launch_template        = "${local.worker_groups_launch_template}"
  worker_group_count                   = 1
  worker_group_launch_template_count   = 1
  worker_additional_security_group_ids = ["${aws_security_group.eks_sec_group.id}"]

  tags                                 = "${local.tags}"
}

resource "aws_security_group" "eks_sec_group" {
  name_prefix             = "eks-sec-group"
  description             = "Security to be applied for eks nodes"
  vpc_id                  = "${var.vpc_id}"

  ingress {
    from_port             = 22
    to_port               = 22
    protocol              = "tcp"
    cidr_blocks           = [
      "10.0.0.0/8",
      "172.16.0.0/12",
      "192.168.0.0/16",
    ]
  }
  tags                    = "${merge(local.tags, map("Name", "${var.cluster_name}-database_sec_group"))}"      
}

data "aws_availability_zones" "available" {}

locals {
  network_count = "${length(data.aws_availability_zones.available.names)}"

  tags = {
    Environment = "${var.environment}"
    Owner       = "${var.owner}"
    Workspace   = "${var.cluster_name}"
  }
}

resource "aws_route53_zone" "hosted_zone" {
  name      = "eks-lab.com"
  comment   = "Private hosted zone for eks cluster"

  vpc {
    vpc_id  = "${module.vpc.vpc_id}"
  }

  tags      = "${local.tags}"
}

module "vpc" {
  source               = "terraform-aws-modules/vpc/aws"
  version              = "1.60.0"
  name                 = "${var.cluster_name}"
  cidr                 = "${var.cidr_block}"
  azs                  = ["${data.aws_availability_zones.available.names[0]}", "${data.aws_availability_zones.available.names[1]}", "${data.aws_availability_zones.available.names[2]}"]
  public_subnets      = [
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, 0)}", 
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, 1)}", 
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, 2)}"
  ]
  private_subnets       = [
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, local.network_count)}", 
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, local.network_count + 1)}", 
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, local.network_count + 2)}"
  ]
  database_subnets  = [
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, local.network_count + 3)}", 
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, local.network_count + 4)}", 
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, local.network_count + 5)}"
  ]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true
  tags                 = "${merge(local.tags, map("kubernetes.io/cluster/${var.cluster_name}", "shared"))}"
}

terraform {
  required_version = ">= 0.11.8"
}

provider "aws" {
  version    = ">= 1.47.0"
  region     = "${var.region}"
}

module "vpc" {
  source           = "./vpc"
  owner            = "${var.owner}"
  environment      = "${var.environment}"
  cluster_name     = "${var.cluster_name}"
  cidr_block       = "${var.cidr_block}"
  cidr_subnet_bits = "${var.cidr_subnet_bits}"
}

module "rds" {
...
}

module "eks" {
  source                    = "./eks"
  owner                     = "${var.owner}"
  environment               = "${var.environment}"
  cluster_name              = "${var.cluster_name}"
  vpc_id                    = "${module.vpc.vpc_id}"
  key_name                  = "${module.bastion.key_name}"
  subnets                   = "${module.vpc.private_subnets}"
  instance_type             = "${var.eks_instance_type}"
  asg_desired_capacity      = "${var.eks_asg_desired_capacity}"
  asg_spot_desired_capacity = "${var.eks_asg_spot_desired_capacity}"
}


module "bastion" {
 ...
}

Also experiencing this issue in 3.0.0:

Error: Error applying plan:

2 error(s) occurred:

* aws_security_group.all_worker_mgmt (destroy): 1 error(s) occurred:

* aws_security_group.all_worker_mgmt: DependencyViolation: resource sg-0caaa8517b45c88af has a dependent object
    status code: 400, request id: 075224e9-6732-40ac-a77d-e0935b7b1bed
* module.eks.aws_security_group.workers (destroy): 1 error(s) occurred:

* aws_security_group.workers: DependencyViolation: resource sg-0c333ddbea0342038 has a dependent object
    status code: 400, request id: a107a26c-bac7-498b-af73-8b76e4e52c58

Running it the second time was successful.

This is happening for me too, and I think I know why.

I setup my cluster to have private access only. The ENIs that hang around and prevent deletion of the SG are created by Amazon accounts. I suspect they're created in order to allow access from the workers to the endpoint via private IPs.

In any case, it seems to be an order-of-operations issue here, as if you first manually destroy the EKS cluster (via console or CLI), the ENIs disappear and destruction of all other resources proceeds without issue. Of course, that confuses things because destroying the cluster first and then the workers doesn't make much sense. Or maybe it doesn't make a difference? That could be a solution to this.

This issue occurs for me in 5.0.0 https://cloud.drone.io/astronomer/terraform-kubernetes-astronomer/8/1/4

I think it's because I am using the parameter worker_additional_security_group_ids

I'm getting the same with 5.1.0 as well, if I use 'worker_groups' to create the worker node pools. The ENIs don't get destroyed with the instances, which prevents the destruction of the worker node security group. But if I use 'worker_groups_launch_template' to create the worker node pools, then the ENIs get destroyed with the instances, and the SG destruction works as expected.

Is there a down side to using worker_groups_launch_template? Maybe it could be the default or recommended way of creating worker node pools?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This issue has been automatically closed because it has not had recent activity since being marked as stale.

/remove-lifecycle stale

@canhnt are you still experiencing this issue ? Is your issue related to this PR https://github.com/terraform-aws-modules/terraform-aws-eks/pull/815 ?

@canhnt are you still experiencing this issue ? Is your issue related to this PR #815 ?

I got the following error when destroying eks with the module terraform-aws-eks:v11.0.0:

Error: Error deleting security group: DependencyViolation: resource sg-017efb07a174d33dc has a dependent object

I think it may not relate to #815 because the cluster is created with public access endpoint.

DependencyViolation is an error returned by the AWS API when a resource is still in use.

Can you find out what was still using the security group when terraform tried to delete it?

DependencyViolation is an error returned by the AWS API when a resource is still in use.

Can you find out what was still using the security group when terraform tried to delete it?

We created EKS with custom networking (pod IPs are in different subnets).

The security group that TF tried to delete and failed refers to the ENI of the pod subnets. After it fails, I check and the ENI is in "available" state and can be deleted manually, then I can also delete the security group as well.

I suspect this bug may relate to the leaking ENI issue when additional ENIs were not deleted when worker node is decommissioned.

Update: I can reproduce the issue. When a worker node is deleted and in terminated state, two ENIs with tag node.k8s.amazonaws.com/instance_id=<id> are in available state but are not deleted. It caused the SG for worker node (with description "Security group for all nodes in the cluster. ") cannot be deleted.

I think I ran into this today, but I'm not using this module (I use the AWS provider directly).

My destroy job was stuck destroying a public subnet and timed out. I found an EKS security group which was attached to an ENI, so I deleted both and then the subnet was destroyed normally on the second attempt. I guess the subnet was waiting on the security group, and the security group was waiting on the ENI like @canhnt mentioned?

For context, I had a LoadBalancer deployed via Kubernetes when I started the Terraform destroy, and I used aws_eks_node_group to provision the workers.

Hope this helps.

Same here. Still experiencing this during destroy. But i am using the private endpoint.

I ran into this as well: private eni's lingering after a terrform destroy on vanilla/fresh cluster.
It seems that it set by:

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/7de18cd9cd882f6ad105ca375b13729537df9e68/workers_launch_template.tf#L226

So, I added eni_delete to my worker_groups config. That is:

module "eks-cluster" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "foobar"
  cluster_version = "1.16"
  ...

  worker_groups = [
    {
     ...
      eni_delete = "true"
    }
  ]
  }
}

This seems to have corrected the issue. I am not using templates in my code explicitly (only the ones in module, implicitly). What I am not understanding is if local.tf has eni_delete = "true", why did I have to do it explicitly?

Hi @sighupper .

The eni_delete setting only applies to launch templates. Setting the value in worker_groups will make no changes to how terraform runs.

Today I ran into this issue, I will troubleshoot and add more details. I do think its the deploying of Kubernetes resources into the cluster, which then creates AWS resources which is making this hang.

terraform-aws-modules/eks/aws =>  v12.2.0
cluster_version => 1.17


➜ terraform --version
Terraform v0.12.21
+ provider.aws v2.70.0
+ provider.external v1.2.0
+ provider.helm v1.2.4
+ provider.kubernetes v1.12.0
+ provider.local v1.4.0
+ provider.null v2.1.2
+ provider.random v2.3.0
+ provider.template v2.1.2
+ provider.tls v2.2.0

I found my issue, I had a null_resource creating an IngressRoute which in-turn, created more resources. Although I was running terraform destroy in the directory that created these resources, the null_resource was only for creating...so it had no way to destroy what it created.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This issue has been automatically closed because it has not had recent activity since being marked as stale.

Was this page helpful?
0 / 5 - 0 ratings