Terraform-provider-aws: aws_emr_cluster and aws_emr_security_configuration destroy issues

Created on 26 Oct 2018 · 10Comments · Source: hashicorp/terraform-provider-aws

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

0.11.8 with AWS provider 1.30

I realize these aren't the latest versions but I don't think this module has changed recently and I haven't had a chance to test yet.

Affected Resource(s)

aws_emr_cluster
aws_emr_security_configuration

What I did

If I have terraform manage my aws_emr_cluster and pass a terraform-managed aws_emr_security_configuration into that cluster, terraform destroy fails consistently to destroy the security configuration.

Expected Behavior

terraform destroy successfully cleans up all the resources it created

Actual Behavior

Error: Error applying plan:

1 error(s) occurred:

* aws_emr_security_configuration.main (destroy): 1 error(s) occurred:

* aws_emr_security_configuration.main: InvalidRequestException: Security configuration 'tf-emr-sc-20181022205505705400000001' cannot be deleted because it is in use by active clusters.
    status code: 400, request id: e6338f23-d936-12e8-ad83-7bea3842861a

Steps to Reproduce

Create a minimal EMR cluster with just aws_emr_cluster and give it an aws_emr_security_configuration.
terraform apply
terraform destroy

Important Factoids

I think it's just not waiting long enough before attempting to destroy the security configuration. If I try again a minute or two later, the destroy works fine.

bug servicemr

Source

copumpkin

👍23

All 10 comments

It doesn't look like the situation has changed, quickly looking at: https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_emr_security_configuration.go#L103

To fix, we can probably just add a resource.Retry() handler to the Delete function there that retries for a minute or two:

input := &emr.DeleteSecurityConfigurationInput{
  Name: aws.String(d.Id()),
}
err := resource.Retry(1*time.Minute, func() *resource.RetryError {
  _, err := conn.DeleteSecurityConfiguration(input)

  if isAWSErr(err, "InvalidRequestException", "does not exist") {
    return nil
  }

  if isAWSErr(err, "InvalidRequestException", "cannot be deleted because it is in use by active clusters") {
    return resource.RetryableError(err)
  }

  if err != nil {
    return resource.NonRetryableError(err)
  }

  return nil
})

if err != nil {
  return fmt.Errorf("error deleting EMR Security Configuration (%s): %s", d.Id(), err)
}

Then to acceptance test, just create a test configuration that creates a security configuration and a cluster that utilizes that security configuration. 👍

I'm wondering if the problem is pretty inconsistent though, because TestAccAWSEMRCluster_security_config already performs something like the test configuration I mention and looking through the last 6 months of our daily acceptance testing I can't find a failure with the above error.

bflad on 26 Oct 2018

Hmm, I can try to reduce my situation then for a proper regression test. I assumed it had nothing to do with the rest of my cluster configuration but maybe it does, if you don't get the issue. Definitely happening 100% of the time here. Given how long EMR clusters take to spin up and down, it'll probably take me a bit to find what's going wrong, but I'll try to post back with some actual terraform.

copumpkin on 29 Oct 2018

We notice the same behaviour, the versions we are using:

Initializing provider plugins...
- Checking for available provider plugins on https://releases.hashicorp.com...
- Downloading plugin for provider "null" (1.0.0)...
- Downloading plugin for provider "aws" (1.51.0)...
- Downloading plugin for provider "template" (1.0.0)...

We ended up using a sleep 100 to mitigate the issue, which is not ideal and we also would like to see it fixed 👍

jepma on 7 Dec 2018

👍1

Would love to see this get fixed. Here is a non-sleep workaround for anyone who is interested. Requires aws and jq.

resource "aws_emr_cluster" "my_cluster" {

  ...

  provisioner "local-exec" {
    when     = "destroy"
    command  = "echo ${aws_emr_cluster.my_cluster.id} > cluster_id.txt"
  }
}

resource "aws_emr_security_configuration" "my_security" {

  ...

  provisioner "local-exec" {
    when     = "destroy"
    command  = "while [ ! `aws emr describe-cluster --cluster-id $(cat cluster_id.txt) | jq 'any(.Cluster.Status.State; contains(\"TERMINATED\"))' | grep true` ]; do sleep 5; done"
  }
}

tedleman on 1 Feb 2019

I don't know if this is related or not, but we're observing "terraform destroy" jobs involving EMR clusters returning as "completed" while the cluster is still in the "Terminating" (as opposed to "Terminated") state.

kdhunter on 14 Jun 2019

Poking into this some more, I wonder if this isn't because the EMR cluster delete method only waits for there to be zero running instances in the cluster, not for AWS to report the cluster as being terminated, which I think was introduced in f7405d0773e9ba50b5ed1072b7e35501058ab786. This likely causes Terraform to think the cluster has been terminated, and the security configuration can be deleted, when from the AWS side, the cluster still exists and the security configuration cannot be deleted.

Any thoughts on changing the cluster deletion wait to wait for EMR to report the state as terminated, rather than for it to have zero running instances?

joelthompson on 11 Mar 2020

Any thoughts on changing the cluster deletion wait to wait for EMR to report the state as terminated, rather than for it to have zero running instances?

Sounds like a great idea. 👍

bflad on 11 Mar 2020

👍1

FYI, we've been using a provider patched with the code in #12578 and it seems like it has fixed this issue for us.

joelthompson on 30 Mar 2020

We actually had to get around this issue by adding lifecycle policy of create_before_destroy to the
aws_emr_security_configuration resource.
```
resource "aws_emr_security_configuration" "my_config" {

...

lifecycle {
create_before_destroy = true
}
}

ashkan3 on 31 Jul 2020

We actually had to get around this issue by adding lifecycle policy of create_before_destroy to the
aws_emr_security_configuration resource.
resource "aws_emr_security_configuration" "my_config" {

  ...

  lifecycle {
    create_before_destroy = true
  }
}

@ashkan3 this didn't seem to work for me? Still getting the dependency error when destroying....