Terraform: resource with create_before_destroy is both created and destroyed before updating dependent resource when dependent resource is in a parent module

Created on 30 Mar 2018 · 9Comments · Source: hashicorp/terraform

Consider the below configuration. When running Terraform apply on a variant of this configuration that requires the IAM role to be replaced (the name will suffice), it performs IAM role create, IAM role destroy, then Lambda update. However, if the below configuration is changed so that the IAM role is declared within the same module as the Lambda function, the actions carried out are IAM role creation, Lambda function update, IAM role deletion, as expected. This leaves a period in which the role reference from Lambda to IAM is dangling. This is made more obvious by the race conditions in IAM that can lead to the Lambda update failing, only to succeed later on retry. In the interventing period, the Lambda is defunct as it still points to a role that has already been destroyed.

Terraform Version

Terraform v0.11.5

provider.archive v1.0.3
provider.aws v1.13.0

Terraform Configuration Files

/main.tf

provider "aws" {
  region = "ap-southeast-2"
}

module "lambda_iam_role" {
  source = "/role_module"
}

data "archive_file" "lambda_content" {
  source_file = "${path.module}/my_test_function.py"
  output_path = "${path.module}/my_test_function.zip"
  type = "zip"
}

resource "aws_lambda_function" "my_test_function" {
  function_name = "my_test_function"
  handler = "my_test_function.lambda_handler"
  role = "${module.lambda_iam_role.role_arn}"
  runtime = "python2.7"
  filename = "${data.archive_file.lambda_content.output_path}"
  source_code_hash = "${data.archive_file.lambda_content.output_base64sha256}"
  publish = true
}

/role_module/main.tf

data "aws_iam_policy_document" "trust_relationship_policy" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "role" {
  assume_role_policy = "${data.aws_iam_policy_document.trust_relationship_policy.json}"
  name = "test2"
  lifecycle {
    create_before_destroy = true
  }
}

output "role_arn" {
  value = "${aws_iam_role.role.arn}"
}

Terraform Apply Output - For role name change "test2" -> "test1"

Please note that the IAM race condition which causes the failure of the Lambda update is not the subject of this issue (I've raise it in #3972). It's only an example of an intermediate failure that halts the infrastructure modification at the point of dangling dependency.

Terraform will perform the following actions:

  ~ aws_lambda_function.my_test_function
      role:                  "arn:aws:iam::901561229270:role/test2" => "${module.lambda_iam_role.role_arn}"

-/+ module.lambda_iam_role.aws_iam_role.role (new resource required)
      id:                    "test2" => <computed> (forces new resource)
      arn:                   "arn:aws:iam::901561229270:role/test2" => <computed>
      assume_role_policy:    "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Sid\":\"\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"lambda.amazonaws.com\"},\"Action\":\"sts:AssumeRole\"}]}" => "{\n  \"Version\": \"2012-10-17\",\n  \"Statement\": [\n    {\n      \"Sid\": \"\",\n      \"Effect\": \"Allow\",\n      \"Action\": \"sts:AssumeRole\",\n      \"Principal\": {\n        \"Service\": \"lambda.amazonaws.com\"\n      }\n    }\n  ]\n}"
      create_date:           "2018-03-29T23:45:03Z" => <computed>
      force_detach_policies: "false" => "false"
      name:                  "test2" => "test1" (forces new resource)
      path:                  "/" => "/"
      unique_id:             "AROAJVRUREPXPP2X3GSCC" => <computed>


Plan: 1 to add, 1 to change, 1 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

module.lambda_iam_role.aws_iam_role.role: Creating...
  arn:                   "" => "<computed>"
  assume_role_policy:    "" => "{\n  \"Version\": \"2012-10-17\",\n  \"Statement\": [\n    {\n      \"Sid\": \"\",\n      \"Effect\": \"Allow\",\n      \"Action\": \"sts:AssumeRole\",\n      \"Principal\": {\n        \"Service\": \"lambda.amazonaws.com\"\n      }\n    }\n  ]\n}"
  create_date:           "" => "<computed>"
  force_detach_policies: "" => "false"
  name:                  "" => "test1"
  path:                  "" => "/"
  unique_id:             "" => "<computed>"
module.lambda_iam_role.aws_iam_role.role: Creation complete after 2s (ID: test1)
module.lambda_iam_role.aws_iam_role.role.deposed: Destroying... (ID: test2)
aws_lambda_function.my_test_function: Modifying... (ID: my_test_function)
  role: "arn:aws:iam::901561229270:role/test2" => "arn:aws:iam::901561229270:role/test1"
module.lambda_iam_role.aws_iam_role.role.deposed: Destruction complete after 2s

Error: Error applying plan:

1 error(s) occurred:

* aws_lambda_function.my_test_function: 1 error(s) occurred:

* aws_lambda_function.my_test_function: Error modifying Lambda Function Configuration my_test_function: InvalidParameterValueException: The role defined for the function cannot be assumed by Lambda.
        status code: 400, request id: 6e1bb062-33ac-11e8-9cdc-0bb5c2732e67

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

Crash Output

Expected Behavior

Using create_before_destroy = true on a dependency of a resource causes a new copy of the dependency to be created before the old one is destroyed. It's expected that the dependent resource's reference to it is updated between re-creation and destruction of the dependency.

Actual Behavior

The new dependency is created and the old one destroyed, both before the dependent resource's reference to it is updated.

Steps to Reproduce

Apply above configuration. Repeat if necessary to overcome failure for Lambda creation.
Change the role name, apply again.

bug core v0.11 v0.12

Source

samjgalbraith

👍3

Most helpful comment

@samjgalbraith,

Thanks for the excellent analysis! I see exactly what you mean now, and since in most cases the effect is the same, it's gone unnoticed for quite a while.

Just recording a minimal example here for reference:

# main.tf
module "mod" {
  source = "./mod"
}

variable "exit" {
  default = 0
}

resource "null_resource" "b" {
  triggers = {
    "a" = "${module.mod.a_id}"
  }
  provisioner "local-exec" {
    command = "exit ${var.exit}"
  }
}

# ./mod/main.tf
output "a_id" {
  value = "${null_resource.a.id}"
}

resource "null_resource" "a" {
  lifecycle {
    create_before_destroy = true
  }
}

After the following commands, there should still be a deposed null_resource.a in the module state.

$ tf apply
$ tf taint -module mod null_resource.a
$ tf apply -var exit=1

Since the CBD graph is not taking into account transitive dependencies through outputs and locals, the dependency inversion that CBD creates isn't extended up through modules.

jbardin on 2 Apr 2018

👍2

All 9 comments

Hi @samjgalbraith,

Thanks for the complete configuration example to replicate the issue here!

If you run this and watch the logs, you can see that the order _is_ correct, and the new aws_lambda_function is being updated with the correct arn in the POST data. The problem is that the arn is not immediately available in the remote API, so which can be tested by adding a delay in the role creation.

This is an eventual consistency issue with the IAM role itself, and a workaround to help with the issue has already been merged into the aws provider.

Thanks!

jbardin on 30 Mar 2018

Thanks for looking at that @jbardin It does seem like you're right about the order when looking at the logs. One thing that still bothers me about this is that it definitely is destroying the old role even though the update fails. Surely the destruction of the dependency should only happen upon successful updating of the references to it.

samjgalbraith on 31 Mar 2018

I'd have to think about if it would be possible to move the destroy node for the deposed resource later in the graph without causing cycles in all cases. The problem however would be that once one of the old dependencies fail and terraform errors out, it no longer references the old destroy node, and there would be no way to remove it in the same order in a subsequent run, so the benefit is minimal.

Also, removing it later on wouldn't not change the result here at all, because it's not the removal of the old resource that is causing the failure, it's the fact that the new one hasn't fully propagated across the remote API endpoint.

jbardin on 31 Mar 2018

You're right that the IAM race condition is what's causing the Terraform apply failure here. That's not the problem that I care about in this issue. It probably got lost in the wall of text but I mentioned it above the terraform output block. I understand the main point of create_or_destroy is to transform infrastructure such that at every step, your infrastructure is always valid and available. I regret using this example because that's a bit of a red herring. It's just an example of a failure which errors the apply partway through, revealing that the plan created allows for the infrastructure to be broken - and I mean the infrastructure is actually broken such that the Lambda function won't execute anymore until you intervene and re-apply, not just that the plan failed to complete. This is a problem bigger than the IAM race condition. The plan is bad.

I've just outputted the graphs for the plan and it's definitely consistent with my complaint: When the role is created in the same module as the Lambda, exactly as I expect for zero-downtime (even if a step fails) the role destruction waits for successful Lambda update - there's a single branch through the plan graph that goes role create| lambda update | role destroy and no failure in any step leaves the infrastructure in a broken state. Even though the apply still fails in this case because of the IAM race condition, crucially, the infrastructure is never in a broken state at any point.

When the role is declared in a child module and its ARN given as a module output, destroying the old role does not block on successful update of the Lambda - destroy role is in a plan branch parallel to the lambda update. There's something about being in a different module that's causing the plan graph to be different.

Everything in one module - Works as expected with role destroy waiting for successful lambda update

provider "aws" {
  region = "ap-southeast-2"
}

data "archive_file" "lambda_content" {
  source_file = "${path.module}/my_test_function.py"
  output_path = "${path.module}/my_test_function.zip"
  type = "zip"
}

resource "aws_lambda_function" "my_test_function" {
  function_name = "my_test_function"
  handler = "my_test_function.lambda_handler"
  role = "${aws_iam_role.role.arn}"
  runtime = "python2.7"
  filename = "${data.archive_file.lambda_content.output_path}"
  source_code_hash = "${data.archive_file.lambda_content.output_base64sha256}"
  publish = true
}


data "aws_iam_policy_document" "trust_relationship_policy" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "role" {
  assume_role_policy = "${data.aws_iam_policy_document.trust_relationship_policy.json}"
  name = "test1"
  lifecycle {
    create_before_destroy = true
  }
}

Plan for name change of role - Labels modified for readability

digraph {
    compound = "true"
    newrank = "true"
    subgraph "root" {
        "[root] aws_iam_role.role" [label = "create role", shape = "box"]
        "[root] aws_iam_role.role (destroy)" [label = "destroy role", shape = "box"]
        "[root] aws_lambda_function.my_test_function" [label = "modify lambda", shape = "box"]
        "[root] provider.aws" [label = "provider.aws", shape = "diamond"]
        "[root] aws_iam_role.role (destroy)" -> "[root] aws_lambda_function.my_test_function"
        "[root] aws_iam_role.role" -> "[root] provider.aws"
        "[root] aws_lambda_function.my_test_function" -> "[root] aws_iam_role.role"
        "[root] meta.count-boundary (count boundary fixup)" -> "[root] aws_iam_role.role (destroy)"
        "[root] provider.aws (close)" -> "[root] aws_iam_role.role (destroy)"
        "[root] root" -> "[root] meta.count-boundary (count boundary fixup)"
        "[root] root" -> "[root] provider.aws (close)"
    }
}

Using the original HCL at the top of this thread, role name change gives this plan, which does NOT make destruction of the role wait for successful Lambda modification:

digraph {
    compound = "true"
    newrank = "true"
    subgraph "root" {
        "[root] aws_lambda_function.my_test_function" [label = "modify lambda", shape = "box"]
        "[root] module.lambda_iam_role.aws_iam_role.role" [label = "create role", shape = "box"]
        "[root] module.lambda_iam_role.aws_iam_role.role (destroy)" [label = "destroy role", shape = "box"]
        "[root] provider.aws" [label = "provider.aws", shape = "diamond"]
        "[root] aws_lambda_function.my_test_function" -> "[root] module.lambda_iam_role.output.role_arn"
        "[root] meta.count-boundary (count boundary fixup)" -> "[root] aws_lambda_function.my_test_function"
        "[root] meta.count-boundary (count boundary fixup)" -> "[root] module.lambda_iam_role.aws_iam_role.role (destroy)"
        "[root] module.lambda_iam_role.aws_iam_role.role (destroy)" -> "[root] module.lambda_iam_role.aws_iam_role.role"
        "[root] module.lambda_iam_role.aws_iam_role.role" -> "[root] provider.aws"
        "[root] module.lambda_iam_role.output.role_arn" -> "[root] module.lambda_iam_role.aws_iam_role.role"
        "[root] provider.aws (close)" -> "[root] aws_lambda_function.my_test_function"
        "[root] provider.aws (close)" -> "[root] module.lambda_iam_role.aws_iam_role.role (destroy)"
        "[root] root" -> "[root] meta.count-boundary (count boundary fixup)"
        "[root] root" -> "[root] provider.aws (close)"
    }
}

samjgalbraith on 31 Mar 2018

@samjgalbraith,

Thanks for the excellent analysis! I see exactly what you mean now, and since in most cases the effect is the same, it's gone unnoticed for quite a while.

Just recording a minimal example here for reference:

# main.tf
module "mod" {
  source = "./mod"
}

variable "exit" {
  default = 0
}

resource "null_resource" "b" {
  triggers = {
    "a" = "${module.mod.a_id}"
  }
  provisioner "local-exec" {
    command = "exit ${var.exit}"
  }
}

# ./mod/main.tf
output "a_id" {
  value = "${null_resource.a.id}"
}

resource "null_resource" "a" {
  lifecycle {
    create_before_destroy = true
  }
}

After the following commands, there should still be a deposed null_resource.a in the module state.

$ tf apply
$ tf taint -module mod null_resource.a
$ tf apply -var exit=1

Since the CBD graph is not taking into account transitive dependencies through outputs and locals, the dependency inversion that CBD creates isn't extended up through modules.

jbardin on 2 Apr 2018

👍2

@jbardin, any thoughts on this? Maybe something that'll be fixed in 0.12?

This is the root cause of https://github.com/terraform-providers/terraform-provider-google/issues/1448. Happy to provide debug logs and the like, though it seems like the problem is pretty well-defined at this point.

cc @brandoconnor since we were debugging this earlier today :)

danawillow on 10 Jul 2018

👍1

Hi @danawillow,

Do I have thoughts!
This isn't intended to be covered at all by the the changes in 0.12, but it's entirely possible that our graph dependency resolution changes could catch this as a side-effect. However I have a feeling it's still going to require a separate fix, and because of the scope of the pending changes it's not something I can take care of until after 0.12 lands.

jbardin on 10 Jul 2018

👍1

Are there any plans to fix this? Looks like it still exists in 0.12

morgante on 29 May 2019

I'm going to lock this issue because it has been closed for _30 days_ ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.