Terraform-provider-aws: terraform attempts to destroy AWS ECS cluster before Deleting ECS Service

Created on 16 Jun 2018  路  21Comments  路  Source: hashicorp/terraform-provider-aws

_This issue was originally opened by @jaloren as hashicorp/terraform#18263. It was migrated here as a result of the provider split. The original body of the issue is below._


I am using the aws_cloudformation_stack resource to provision an aws Elastic Container Service cluster and one or more services in that cluster. I used terraform graph -type=plan-destroy to verify that I successfully set up a dependency relationship in terraform between the TF resource for creating the service and the TF resource for creating the ECS cluster.

According to graphviz, the service is a child node of the ecs cluster node. Given that, I am expecting TF to delete the service and then delete the cluster. However, this seems to happen out of order, which causes the delete of the ECS cluster to fail since you can't delete a cluster that has services in it.

Terraform Version

Terraform v0.11.8

Expected Behavior

Terraform successfully delete aws ECS cluster and its associated services.

Actual Behavior

Terraform successfully deleted the service in the ECS cluster but failed to delete the ECS cluster itself with the following error:

* aws_cloudformation_stack.ecs-cluster: DELETE_FAILED: ["The following resource(s) failed to delete: [ECSCluster]. " "The Cluster cannot be deleted while Services are active. (Service: AmazonECS; Status Code: 400; Error Code: ClusterContainsServicesException; Request ID: 7bcbeae4-70ab-11e8-bd0b-3d3254c7f7d3)"]

Steps to Reproduce

Please list the full steps required to reproduce the issue, for example:

  1. terraform init
  2. terraform plan
  3. terraform apply
bug servicecs

Most helpful comment

+1 having the same issue here. Latest version on Terraform Cloud

All 21 comments

This Should work if we stop the ECS services and try deleting ECS Cluster.

@avengers009 you're right, but ideally Terraform should be able to schedule these actions accordingly, where possible, or if not possible the user should be able to hint Terraform via depends_on. TL;DR users shouldn't need to manually touch the infrastructure in order to run apply or destroy successfully.

@jaloren Do you mind sharing the configs with us to understand the relationships between resources and allow us reproduce the problem?

Thanks.

I am also seeing this issue:

Error: Error applying plan:

1 error(s) occurred:

* aws_ecs_cluster.ecs (destroy): 1 error(s) occurred:

* aws_ecs_cluster.ecs: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.
    status code: 400, request id: 30e1e812-854c-11e8-bec1-397064633d2b

Here is my configuration:

ecs_service

resource "aws_ecs_service" "authenticator" {
  name            = "authenticator"
  cluster         = "${aws_ecs_cluster.ecs.id}"
  task_definition = "${aws_ecs_task_definition.authenticator.arn}"
  desired_count   = 2

  load_balancer {
    target_group_arn = "${aws_lb_target_group.authenticator.arn}"
    container_name   = "authenticator"
    container_port   = 3030
  }
}

ecs_cluster

resource "aws_ecs_cluster" "ecs" {
  name = "${local.safe_name_prefix}"
}

@Kartstig is that error occurring for you after 10 minutes or so of trying?

Yes it does. I usually make an attempt to destroy twice to account for any timeouts

I'm seeing very similar behavior with Terraform 0.11.7/AWS provider 1.19. I am frequently (but not every time) seeing this behavior:

00:12:27.512 aws_ecs_cluster.ecs_cluster: Still destroying... (ID: arn:aws:ecs:us-east-1:<MYACCOUNT>:cluster/my-service, 9m50s elapsed)
00:12:36.041 
00:12:36.042 Error: Error applying plan:
00:12:36.043 
00:12:36.044 1 error(s) occurred:
00:12:36.045 
00:12:36.045 * aws_ecs_cluster.ecs_cluster (destroy): 1 error(s) occurred:
00:12:36.046 
00:12:36.046 * aws_ecs_cluster.ecs_cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining.
00:12:36.047    status code: 400, request id: b920a9e3-8b45-11e8-8e1a-0751c6fe0d1a

@radeksimko I am not sure how much of the configs you would like to see. Its a little bit involved. But here's the key part of the main.tf in the root module.

Each module is nothing but a wrapper for a cloudformation template. So by referring to the output from one module as input in another, I am establishing a dependency between the two resources encapsulated in each module. Ergo, I am expecting on a destroy that the cluster would be deleted after the service since the service depends on the cluster

module "public_load_balancer" {
  source         = "../../modules/aws/network/load_balancer/alb"
  environment    = "${var.environment}"
  security_group = "${module.network_acls.load_balancer_security_group}"
  vpc            = "${module.network.vpc}"
  subnets        = "${module.network.public_one_subnet},${module.network.public_two_subnet}"
}

module "ecs_cluster" {
  source      = "../../modules/aws/ecs/cluster"
  environment = "${var.environment}"
}

module "log_group" {
  source        = "../../modules/aws/logs/log_group"
  environment   = "${var.environment}"
  log_retention = 3
}

module "ecs_application" {
  source                   = "../../modules/aws/ecs/services/ecsapp"
  subnets                  = "${module.network.ecs_traffic_one},${module.network.ecs_traffic_two}"
  target_group             = "${module.public_load_balancer.enrollment_api_target_group}"
  environment              = "${var.environment}"
  security_group           = "${module.network_acls.container_security_group}"
  vpc                      = "${module.network.vpc}"
  tag                      = "v1.0.0"
  log_group                = "${module.log_group.id}"
  cluster_name             = "${module.ecs_cluster.name}"
}

Any update on this issue? Is there a plan to fix this? Or at least provide/output a machine readable list of services to be destroyed before destroying the instances?

I think Terraform should stop/terminate the instances as part of the destroy process, right now you have to manually terminate instances in order for the destroy action to finish.

hey, we are trying to automate this destruction of instances instead of doing it manually. Is there a recommended way to automate this? Our application code is in Java.

One way to do this could be to parse the generated terraform plan(by "terraform destroy" command). Can you help us find a way to parse the terraform plan to identify what instances/clusters need to be destroyed?

You can prevent that situation with splitting your terraform project in at least two. You can use remote_state for that. If you put ECS cluster and service creation in two different projects, when you want to destroy, you can call first destroy process of service, then ECS cluster can be destroyed without any problem

Is there any solution here? Terraform was working great for me and now I'm having the same error "The Cluster cannot be deleted while Services are active" and don't understand why I need to manually stop/terminate the instances...

I am seeing this with 0.12.7 in my company's production environment intermittently. Is there any way to specify a "depends_on" or "teardown_first" which works for teardown?

I am seeing this still on latest version...

I'm here for the same issue - has anyone found a workaround? Or can anyone confirm that this _sometimes_ works (even after n retries)? Otherwise, it seems the aws_ecs_service resource is broken. The core promise is that terraform apply followed by terraform destroy will just work.

Hoping to better understand if this _never_ works or if it's just a retry/interim issue or an issue particular to a set of configs.

UPDATE: In my particular instance, I can confirm upon retry that terraform destroy does not list the ECS cluster as something to be destroyed - meaning the destroy of the ECS service failed at some point but was logged as destroyed anyway. (Or conversely, I guess, I could have been created and not correctly confirmed as created.) I will post back here if I have additional test results.

+1 having the same issue here. Latest version on Terraform Cloud

I have the same issue with terraform 0.12.19

Hey everyone, I'm using AWS CloudFormation and I'm experiencing this issue as well. I'm currently suspecting that it's not an issue with either CloudFormation or Terraform, but possibly with the underlying EC2 AMI. I'm using the Amazon Linux 2 AMI, while an example I'm referencing is using Amazon Linux 1, and the latter deletes fine while my former does not (even with an explicit DependsOn and Refs sprinkled throughout). There were a good number of changes to Amazon Linux 2, which I'm guessing may have included a change to cfn-bootstrap which might impact /opt/aws/cfn-signal behaviors. I haven't tested this out though.

Not sure if this is the right place to complain, but probably the same issue here:

Error: Error draining autoscaling group: Group still has 1 instances

Error: Error deleting ECS cluster: ClusterContainsContainerInstancesException: The Cluster cannot be deleted while Container Instances are active or draining

Surprisingly, 2 moments:

  • I logged into the AWS console, noticed that ECS instance is in "Active" state, but was able to remove ECS Cluster immediately, without any warning/error! That EC2 instance kept working until I terminated it manually.
  • somehow, sometimes, it worked before!

Terraform v0.12.20 code is being used:

data "aws_ami" "amazon2_ecs_optimized" {}

resource "aws_launch_template" "this" {}

resource "aws_autoscaling_group" "this" {}

resource "aws_ecs_task_definition" "this" {}

resource "aws_ecs_service" "default" {
  #  ...
  depends_on = [
    # consider note at https://www.terraform.io/docs/providers/aws/r/ecs_service.html
    aws_iam_role_policy.ecs_service
  ]
  # ...
}

resource "aws_ecs_cluster" "application" {}

p.s. will try to build workaround with null_resource and local-exec provisioner with when = destroy strategy running aws cli to find and deregister ECS EC2 instances... but it's sad in terms of "reliable" cloud services.

I have also faced the exact similar issue as raised by mikalai-t.
@mikalai-t would you like to share what steps did you follow as workaround.

Still didn't implement a workaround, but... I noticed that sometimes even termination process took a while, so I assumed our application becomes unresponsive and consumes too much CPU and therefore EC2 instance failed to respond in time.
I just configured t3a.small instead of t3a.micro and the issue hasn't appeared since then. Not sure if this is a final solution, but you can start from analyzing your application behavior on a different instance type.
Also I would recommend to check current instance's protect from scale-in setting. I had similar issue when I stopped using ECS Capacity Provider and forgot to set this setting to false.
btw... Even with capacity provider configured in the cluster I faced timeouts when destroying the ASG, but after a couple of repeated attempts it was always successful.

Was this page helpful?
0 / 5 - 0 ratings