Terraform-aws-eks: Destroy never succeeds, DependencyViolation for Security Group

Created on 25 Feb 2019 · 29Comments · Source: terraform-aws-modules/terraform-aws-eks

I have issues

I'm submitting a...

[x] bug report
[ ] feature request
[x] support request
[x] kudos, thank you, warm fuzzy

What is the current behavior?

A cluster cannot be destroyed without manual intervention

If this is a bug, how to reproduce? Please include a code sample if relevant.

Given this (stripped down but working version) of the cluster:

data "aws_region" "current" {}

data "aws_availability_zones" "az" {}

locals {
  worker_groups_launch_template = [
    {
      instance_type        = "t2.small"
      subnets              = "${join(",", module.vpc.private_subnets)}"
      asg_desired_capacity = "2"
    },
    {
      instance_type                            = "t2.small"
      subnets                                  = "${join(",", module.vpc.private_subnets)}"
      override_instance_type                   = "t3.small"
      asg_desired_capacity                     = "2"
      spot_instance_pools                      = 10
      on_demand_percentage_above_base_capacity = "0"
    },
  ]
}

module "vpc" {
  source             = "terraform-aws-modules/vpc/aws"
  version            = "1.57.0"
  name               = "test-eks-thingy"
  cidr               = "192.168.0.0/16"
  azs                = "${data.aws_availability_zones.az.names}"
  private_subnets    = ["192.168.1.0/24", "192.168.2.0/24", "192.168.3.0/24"]
  public_subnets     = ["192.168.4.0/24", "192.168.5.0/24", "192.168.6.0/24"]
  enable_nat_gateway = true
  single_nat_gateway = true
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "2.2.1"

  cluster_version = "1.11"
  cluster_name    = "test-eks-cluster"
  subnets         = ["${module.vpc.private_subnets}"]
  vpc_id          = "${module.vpc.vpc_id}"

  worker_groups_launch_template        = "${local.worker_groups_launch_template}"
  worker_group_launch_template_count   = "1"

  map_roles          = []
  map_roles_count    = 0
  map_users          = []
  map_users_count    = 0
  map_accounts       = []
  map_accounts_count = 0
}

terraform apply completes fine, however terraform destroy fails with

module.eks.aws_security_group.workers: Still destroying... (ID: sg-0e5f0b620ea6e8bc0, 9m50s elapsed)
module.eks.aws_security_group.workers: Still destroying... (ID: sg-0e5f0b620ea6e8bc0, 10m0s elapsed)

Error: Error applying plan:

1 error(s) occurred:

* module.eks.aws_security_group.workers (destroy): 1 error(s) occurred:

* aws_security_group.workers: DependencyViolation: resource sg-0e5f0b620ea6e8bc0 has a dependent object
        status code: 400, request id: bc0096af-9940-42bf-a727-e1a51c9d21b3

Network Interfaces in the AWS console ends up looking like this:

Manually detaching the green interfaces and deleting them all allows terraform to complete destruction.

What's the expected behavior?

Cluster can be destroyed entirely by terraform itself.

Are you able to fix this problem and submit a PR? Link here if you have already.

Environment details

Affected module version: 2.2.1
OS: macOS 10.14.3
Terraform version:

Terraform v0.11.11
+ provider.aws v1.60.0
+ provider.local v1.1.0
+ provider.null v2.0.0
+ provider.template v2.0.0

Any other relevant info

Resources remaining after attempted destroy:

Terraform will perform the following actions:

  - module.eks.aws_eks_cluster.this

  - module.eks.aws_iam_role.cluster

  - module.eks.aws_iam_role_policy_attachment.cluster_AmazonEKSClusterPolicy

  - module.eks.aws_iam_role_policy_attachment.cluster_AmazonEKSServicePolicy

  - module.eks.aws_security_group.cluster

  - module.eks.aws_security_group.workers

  - module.vpc.aws_subnet.private[0]

  - module.vpc.aws_subnet.private[1]

  - module.vpc.aws_subnet.private[2]

  - module.vpc.aws_vpc.this

stale

Source

diversario

👍22 👎1

Most helpful comment

This is happening for me too, and I think I know why.

I setup my cluster to have private access only. The ENIs that hang around and prevent deletion of the SG are created by Amazon accounts. I suspect they're created in order to allow access from the workers to the endpoint via private IPs.

In any case, it seems to be an order-of-operations issue here, as if you first manually destroy the EKS cluster (via console or CLI), the ENIs disappear and destruction of all other resources proceeds without issue. Of course, that confuses things because destroying the cluster first and then the workers doesn't make much sense. Or maybe it doesn't make a difference? That could be a solution to this.

AirbornePorcine on 16 May 2019

👍2

All 29 comments

Im also facing this issue and I can attest the same behavior.

uday1bhanu on 26 Feb 2019

Not sure if this is an issue with the module itself. I had the same problem and to solve it I had to remove all the kubernetes services before destroying it.

luisc09 on 27 Feb 2019

@luisc09 manually?

diversario on 27 Feb 2019

Yes manually. If you create a service with type ELB then k8s will create a security group for this ELB. And this will stop the destroy process.

max-rocket-internet on 5 Mar 2019

👍2

latest run of eks cluster creation followed by destroy is successful. Not sure what has changed, but this didnt work before.

Note: In my case, i haven't deployed any apps after provisioning the cluster.

uday1bhanu on 7 Mar 2019

Not sure what has changed, but this didnt work before.

OK no worries. Feel free to debug and add info here.

max-rocket-internet on 7 Mar 2019

I have just used this module, since I have moved from premises, and trying to create an eks cluster with terraform. In my case I have used a little modification of the example fixture, apply and then destroy with any other interaction with the eks cluster. I got two DependencyViolation error with security groups attached with interfaces.

jsa4000 on 28 Mar 2019

Hi,
Just tested the merge request #311
I have tested it and it fixes my issue with the _DependencyViolation_, so I can destroy the cluster without any peroblem.

jsa4000 on 29 Mar 2019

👍2

OK cool, then perhaps we merge that PR to solve this issue. It sounds like it would be a popular option.

Question: Don't you have left over ENIs and security groups after cluster is destroyed?

max-rocket-internet on 3 Apr 2019

👍2

Hi, I don't think so. The creation is very simple since it was just for a PoC.
You can get more into the code here

Bellow are some fragments of the terraform code

locals {
  tags = {
    Environment = "${var.environment}"
    Owner       = "${var.owner}"
    Workspace   = "${var.cluster_name}"
  }
  worker_groups = [
    {
      instance_type        = "${var.instance_type}"
      key_name             = "${var.key_name}"
      subnets              = "${join(",", var.subnets)}"
      additional_userdata  = "${file("${path.module}/user_data.sh")}"
      asg_desired_capacity = "${var.asg_desired_capacity}"
    },
  ]
  worker_groups_launch_template = [
    {
      instance_type                            = "${var.instance_type}"
      key_name                                 = "${var.key_name}"
      subnets                                  = "${join(",", var.subnets)}"
      additional_userdata                      = "${file("${path.module}/user_data.sh")}"
      asg_desired_capacity                     = "${var.asg_spot_desired_capacity}"
      spot_instance_pools                      = "${var.spot_instance_pools}"
      on_demand_percentage_above_base_capacity = "0"
    },
]
}

module "eks" {
  source                               = "./terraform-aws-eks"
  #source                               = "terraform-aws-modules/eks/aws"
  #version                              = "2.3.1" 
  cluster_name                         = "${var.cluster_name}"
  subnets                              = ["${var.subnets}"]
  vpc_id                               = "${var.vpc_id}"
  worker_groups                        = "${local.worker_groups}"
  worker_groups_launch_template        = "${local.worker_groups_launch_template}"
  worker_group_count                   = 1
  worker_group_launch_template_count   = 1
  worker_additional_security_group_ids = ["${aws_security_group.eks_sec_group.id}"]

  tags                                 = "${local.tags}"
}

resource "aws_security_group" "eks_sec_group" {
  name_prefix             = "eks-sec-group"
  description             = "Security to be applied for eks nodes"
  vpc_id                  = "${var.vpc_id}"

  ingress {
    from_port             = 22
    to_port               = 22
    protocol              = "tcp"
    cidr_blocks           = [
      "10.0.0.0/8",
      "172.16.0.0/12",
      "192.168.0.0/16",
    ]
  }
  tags                    = "${merge(local.tags, map("Name", "${var.cluster_name}-database_sec_group"))}"      
}

data "aws_availability_zones" "available" {}

locals {
  network_count = "${length(data.aws_availability_zones.available.names)}"

  tags = {
    Environment = "${var.environment}"
    Owner       = "${var.owner}"
    Workspace   = "${var.cluster_name}"
  }
}

resource "aws_route53_zone" "hosted_zone" {
  name      = "eks-lab.com"
  comment   = "Private hosted zone for eks cluster"

  vpc {
    vpc_id  = "${module.vpc.vpc_id}"
  }

  tags      = "${local.tags}"
}

module "vpc" {
  source               = "terraform-aws-modules/vpc/aws"
  version              = "1.60.0"
  name                 = "${var.cluster_name}"
  cidr                 = "${var.cidr_block}"
  azs                  = ["${data.aws_availability_zones.available.names[0]}", "${data.aws_availability_zones.available.names[1]}", "${data.aws_availability_zones.available.names[2]}"]
  public_subnets      = [
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, 0)}", 
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, 1)}", 
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, 2)}"
  ]
  private_subnets       = [
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, local.network_count)}", 
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, local.network_count + 1)}", 
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, local.network_count + 2)}"
  ]
  database_subnets  = [
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, local.network_count + 3)}", 
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, local.network_count + 4)}", 
    "${cidrsubnet(var.cidr_block, var.cidr_subnet_bits, local.network_count + 5)}"
  ]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true
  tags                 = "${merge(local.tags, map("kubernetes.io/cluster/${var.cluster_name}", "shared"))}"
}

terraform {
  required_version = ">= 0.11.8"
}

provider "aws" {
  version    = ">= 1.47.0"
  region     = "${var.region}"
}

module "vpc" {
  source           = "./vpc"
  owner            = "${var.owner}"
  environment      = "${var.environment}"
  cluster_name     = "${var.cluster_name}"
  cidr_block       = "${var.cidr_block}"
  cidr_subnet_bits = "${var.cidr_subnet_bits}"
}

module "rds" {
...
}

module "eks" {
  source                    = "./eks"
  owner                     = "${var.owner}"
  environment               = "${var.environment}"
  cluster_name              = "${var.cluster_name}"
  vpc_id                    = "${module.vpc.vpc_id}"
  key_name                  = "${module.bastion.key_name}"
  subnets                   = "${module.vpc.private_subnets}"
  instance_type             = "${var.eks_instance_type}"
  asg_desired_capacity      = "${var.eks_asg_desired_capacity}"
  asg_spot_desired_capacity = "${var.eks_asg_spot_desired_capacity}"
}


module "bastion" {
 ...
}

jsa4000 on 5 Apr 2019

Also experiencing this issue in 3.0.0:

Error: Error applying plan:

2 error(s) occurred:

* aws_security_group.all_worker_mgmt (destroy): 1 error(s) occurred:

* aws_security_group.all_worker_mgmt: DependencyViolation: resource sg-0caaa8517b45c88af has a dependent object
    status code: 400, request id: 075224e9-6732-40ac-a77d-e0935b7b1bed
* module.eks.aws_security_group.workers (destroy): 1 error(s) occurred:

* aws_security_group.workers: DependencyViolation: resource sg-0c333ddbea0342038 has a dependent object
    status code: 400, request id: a107a26c-bac7-498b-af73-8b76e4e52c58

Running it the second time was successful.

danielsiwiec on 7 May 2019

This is happening for me too, and I think I know why.

AirbornePorcine on 16 May 2019

👍2

This issue occurs for me in 5.0.0 https://cloud.drone.io/astronomer/terraform-kubernetes-astronomer/8/1/4

I think it's because I am using the parameter worker_additional_security_group_ids

sjmiller609 on 11 Jul 2019

I'm getting the same with 5.1.0 as well, if I use 'worker_groups' to create the worker node pools. The ENIs don't get destroyed with the instances, which prevents the destruction of the worker node security group. But if I use 'worker_groups_launch_template' to create the worker node pools, then the ENIs get destroyed with the instances, and the SG destruction works as expected.

Is there a down side to using worker_groups_launch_template? Maybe it could be the default or recommended way of creating worker node pools?

petrikero on 4 Aug 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 3 Jan 2020

This issue has been automatically closed because it has not had recent activity since being marked as stale.

stale[bot] on 3 Feb 2020

/remove-lifecycle stale

canhnt on 13 Apr 2020

@canhnt are you still experiencing this issue ? Is your issue related to this PR https://github.com/terraform-aws-modules/terraform-aws-eks/pull/815 ?

barryib on 13 Apr 2020

@canhnt are you still experiencing this issue ? Is your issue related to this PR #815 ?

I got the following error when destroying eks with the module terraform-aws-eks:v11.0.0:

Error: Error deleting security group: DependencyViolation: resource sg-017efb07a174d33dc has a dependent object

I think it may not relate to #815 because the cluster is created with public access endpoint.

canhnt on 14 Apr 2020

DependencyViolation is an error returned by the AWS API when a resource is still in use.

Can you find out what was still using the security group when terraform tried to delete it?

dpiddockcmp on 14 Apr 2020

DependencyViolation is an error returned by the AWS API when a resource is still in use.

Can you find out what was still using the security group when terraform tried to delete it?

We created EKS with custom networking (pod IPs are in different subnets).

The security group that TF tried to delete and failed refers to the ENI of the pod subnets. After it fails, I check and the ENI is in "available" state and can be deleted manually, then I can also delete the security group as well.

I suspect this bug may relate to the leaking ENI issue when additional ENIs were not deleted when worker node is decommissioned.

Update: I can reproduce the issue. When a worker node is deleted and in terminated state, two ENIs with tag node.k8s.amazonaws.com/instance_id=<id> are in available state but are not deleted. It caused the SG for worker node (with description "Security group for all nodes in the cluster. ") cannot be deleted.

canhnt on 14 Apr 2020

I think I ran into this today, but I'm not using this module (I use the AWS provider directly).

My destroy job was stuck destroying a public subnet and timed out. I found an EKS security group which was attached to an ENI, so I deleted both and then the subnet was destroyed normally on the second attempt. I guess the subnet was waiting on the security group, and the security group was waiting on the ENI like @canhnt mentioned?

For context, I had a LoadBalancer deployed via Kubernetes when I started the Terraform destroy, and I used aws_eks_node_group to provision the workers.

Hope this helps.

synek on 18 Apr 2020

Same here. Still experiencing this during destroy. But i am using the private endpoint.

haofeif on 26 May 2020

I ran into this as well: private eni's lingering after a terrform destroy on vanilla/fresh cluster.
It seems that it set by:

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/7de18cd9cd882f6ad105ca375b13729537df9e68/workers_launch_template.tf#L226

So, I added eni_delete to my worker_groups config. That is:

module "eks-cluster" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "foobar"
  cluster_version = "1.16"
  ...

  worker_groups = [
    {
     ...
      eni_delete = "true"
    }
  ]
  }
}

This seems to have corrected the issue. I am not using templates in my code explicitly (only the ones in module, implicitly). What I am not understanding is if local.tf has eni_delete = "true", why did I have to do it explicitly?

sighupper on 24 Jun 2020

Hi @sighupper .

The eni_delete setting only applies to launch templates. Setting the value in worker_groups will make no changes to how terraform runs.

dpiddockcmp on 28 Jun 2020

Today I ran into this issue, I will troubleshoot and add more details. I do think its the deploying of Kubernetes resources into the cluster, which then creates AWS resources which is making this hang.

terraform-aws-modules/eks/aws =>  v12.2.0
cluster_version => 1.17


➜ terraform --version
Terraform v0.12.21
+ provider.aws v2.70.0
+ provider.external v1.2.0
+ provider.helm v1.2.4
+ provider.kubernetes v1.12.0
+ provider.local v1.4.0
+ provider.null v2.1.2
+ provider.random v2.3.0
+ provider.template v2.1.2
+ provider.tls v2.2.0

Tokynet on 11 Aug 2020

I found my issue, I had a null_resource creating an IngressRoute which in-turn, created more resources. Although I was running terraform destroy in the directory that created these resources, the null_resource was only for creating...so it had no way to destroy what it created.

Tokynet on 14 Aug 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 12 Nov 2020

This issue has been automatically closed because it has not had recent activity since being marked as stale.

stale[bot] on 12 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings