Consider the below configuration. When running Terraform apply on a variant of this configuration that requires the IAM role to be replaced (the name will suffice), it performs IAM role create, IAM role destroy, then Lambda update. However, if the below configuration is changed so that the IAM role is declared within the same module as the Lambda function, the actions carried out are IAM role creation, Lambda function update, IAM role deletion, as expected. This leaves a period in which the role reference from Lambda to IAM is dangling. This is made more obvious by the race conditions in IAM that can lead to the Lambda update failing, only to succeed later on retry. In the interventing period, the Lambda is defunct as it still points to a role that has already been destroyed.
Terraform v0.11.5
/main.tf
provider "aws" {
region = "ap-southeast-2"
}
module "lambda_iam_role" {
source = "/role_module"
}
data "archive_file" "lambda_content" {
source_file = "${path.module}/my_test_function.py"
output_path = "${path.module}/my_test_function.zip"
type = "zip"
}
resource "aws_lambda_function" "my_test_function" {
function_name = "my_test_function"
handler = "my_test_function.lambda_handler"
role = "${module.lambda_iam_role.role_arn}"
runtime = "python2.7"
filename = "${data.archive_file.lambda_content.output_path}"
source_code_hash = "${data.archive_file.lambda_content.output_base64sha256}"
publish = true
}
/role_module/main.tf
data "aws_iam_policy_document" "trust_relationship_policy" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["lambda.amazonaws.com"]
}
}
}
resource "aws_iam_role" "role" {
assume_role_policy = "${data.aws_iam_policy_document.trust_relationship_policy.json}"
name = "test2"
lifecycle {
create_before_destroy = true
}
}
output "role_arn" {
value = "${aws_iam_role.role.arn}"
}
Please note that the IAM race condition which causes the failure of the Lambda update is not the subject of this issue (I've raise it in #3972). It's only an example of an intermediate failure that halts the infrastructure modification at the point of dangling dependency.
Terraform will perform the following actions:
~ aws_lambda_function.my_test_function
role: "arn:aws:iam::901561229270:role/test2" => "${module.lambda_iam_role.role_arn}"
-/+ module.lambda_iam_role.aws_iam_role.role (new resource required)
id: "test2" => <computed> (forces new resource)
arn: "arn:aws:iam::901561229270:role/test2" => <computed>
assume_role_policy: "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Sid\":\"\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"lambda.amazonaws.com\"},\"Action\":\"sts:AssumeRole\"}]}" => "{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Sid\": \"\",\n \"Effect\": \"Allow\",\n \"Action\": \"sts:AssumeRole\",\n \"Principal\": {\n \"Service\": \"lambda.amazonaws.com\"\n }\n }\n ]\n}"
create_date: "2018-03-29T23:45:03Z" => <computed>
force_detach_policies: "false" => "false"
name: "test2" => "test1" (forces new resource)
path: "/" => "/"
unique_id: "AROAJVRUREPXPP2X3GSCC" => <computed>
Plan: 1 to add, 1 to change, 1 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
module.lambda_iam_role.aws_iam_role.role: Creating...
arn: "" => "<computed>"
assume_role_policy: "" => "{\n \"Version\": \"2012-10-17\",\n \"Statement\": [\n {\n \"Sid\": \"\",\n \"Effect\": \"Allow\",\n \"Action\": \"sts:AssumeRole\",\n \"Principal\": {\n \"Service\": \"lambda.amazonaws.com\"\n }\n }\n ]\n}"
create_date: "" => "<computed>"
force_detach_policies: "" => "false"
name: "" => "test1"
path: "" => "/"
unique_id: "" => "<computed>"
module.lambda_iam_role.aws_iam_role.role: Creation complete after 2s (ID: test1)
module.lambda_iam_role.aws_iam_role.role.deposed: Destroying... (ID: test2)
aws_lambda_function.my_test_function: Modifying... (ID: my_test_function)
role: "arn:aws:iam::901561229270:role/test2" => "arn:aws:iam::901561229270:role/test1"
module.lambda_iam_role.aws_iam_role.role.deposed: Destruction complete after 2s
Error: Error applying plan:
1 error(s) occurred:
* aws_lambda_function.my_test_function: 1 error(s) occurred:
* aws_lambda_function.my_test_function: Error modifying Lambda Function Configuration my_test_function: InvalidParameterValueException: The role defined for the function cannot be assumed by Lambda.
status code: 400, request id: 6e1bb062-33ac-11e8-9cdc-0bb5c2732e67
Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.
Using create_before_destroy = true on a dependency of a resource causes a new copy of the dependency to be created before the old one is destroyed. It's expected that the dependent resource's reference to it is updated between re-creation and destruction of the dependency.
The new dependency is created and the old one destroyed, both before the dependent resource's reference to it is updated.
Hi @samjgalbraith,
Thanks for the complete configuration example to replicate the issue here!
If you run this and watch the logs, you can see that the order _is_ correct, and the new aws_lambda_function is being updated with the correct arn in the POST data. The problem is that the arn is not immediately available in the remote API, so which can be tested by adding a delay in the role creation.
This is an eventual consistency issue with the IAM role itself, and a workaround to help with the issue has already been merged into the aws provider.
Thanks!
Thanks for looking at that @jbardin It does seem like you're right about the order when looking at the logs. One thing that still bothers me about this is that it definitely is destroying the old role even though the update fails. Surely the destruction of the dependency should only happen upon successful updating of the references to it.
I'd have to think about if it would be possible to move the destroy node for the deposed resource later in the graph without causing cycles in all cases. The problem however would be that once one of the old dependencies fail and terraform errors out, it no longer references the old destroy node, and there would be no way to remove it in the same order in a subsequent run, so the benefit is minimal.
Also, removing it later on wouldn't not change the result here at all, because it's not the removal of the old resource that is causing the failure, it's the fact that the new one hasn't fully propagated across the remote API endpoint.
You're right that the IAM race condition is what's causing the Terraform apply failure here. That's not the problem that I care about in this issue. It probably got lost in the wall of text but I mentioned it above the terraform output block. I understand the main point of create_or_destroy is to transform infrastructure such that at every step, your infrastructure is always valid and available. I regret using this example because that's a bit of a red herring. It's just an example of a failure which errors the apply partway through, revealing that the plan created allows for the infrastructure to be broken - and I mean the infrastructure is actually broken such that the Lambda function won't execute anymore until you intervene and re-apply, not just that the plan failed to complete. This is a problem bigger than the IAM race condition. The plan is bad.
I've just outputted the graphs for the plan and it's definitely consistent with my complaint: When the role is created in the same module as the Lambda, exactly as I expect for zero-downtime (even if a step fails) the role destruction waits for successful Lambda update - there's a single branch through the plan graph that goes role create| lambda update | role destroy and no failure in any step leaves the infrastructure in a broken state. Even though the apply still fails in this case because of the IAM race condition, crucially, the infrastructure is never in a broken state at any point.
When the role is declared in a child module and its ARN given as a module output, destroying the old role does not block on successful update of the Lambda - destroy role is in a plan branch parallel to the lambda update. There's something about being in a different module that's causing the plan graph to be different.
provider "aws" {
region = "ap-southeast-2"
}
data "archive_file" "lambda_content" {
source_file = "${path.module}/my_test_function.py"
output_path = "${path.module}/my_test_function.zip"
type = "zip"
}
resource "aws_lambda_function" "my_test_function" {
function_name = "my_test_function"
handler = "my_test_function.lambda_handler"
role = "${aws_iam_role.role.arn}"
runtime = "python2.7"
filename = "${data.archive_file.lambda_content.output_path}"
source_code_hash = "${data.archive_file.lambda_content.output_base64sha256}"
publish = true
}
data "aws_iam_policy_document" "trust_relationship_policy" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["lambda.amazonaws.com"]
}
}
}
resource "aws_iam_role" "role" {
assume_role_policy = "${data.aws_iam_policy_document.trust_relationship_policy.json}"
name = "test1"
lifecycle {
create_before_destroy = true
}
}
digraph {
compound = "true"
newrank = "true"
subgraph "root" {
"[root] aws_iam_role.role" [label = "create role", shape = "box"]
"[root] aws_iam_role.role (destroy)" [label = "destroy role", shape = "box"]
"[root] aws_lambda_function.my_test_function" [label = "modify lambda", shape = "box"]
"[root] provider.aws" [label = "provider.aws", shape = "diamond"]
"[root] aws_iam_role.role (destroy)" -> "[root] aws_lambda_function.my_test_function"
"[root] aws_iam_role.role" -> "[root] provider.aws"
"[root] aws_lambda_function.my_test_function" -> "[root] aws_iam_role.role"
"[root] meta.count-boundary (count boundary fixup)" -> "[root] aws_iam_role.role (destroy)"
"[root] provider.aws (close)" -> "[root] aws_iam_role.role (destroy)"
"[root] root" -> "[root] meta.count-boundary (count boundary fixup)"
"[root] root" -> "[root] provider.aws (close)"
}
}
digraph {
compound = "true"
newrank = "true"
subgraph "root" {
"[root] aws_lambda_function.my_test_function" [label = "modify lambda", shape = "box"]
"[root] module.lambda_iam_role.aws_iam_role.role" [label = "create role", shape = "box"]
"[root] module.lambda_iam_role.aws_iam_role.role (destroy)" [label = "destroy role", shape = "box"]
"[root] provider.aws" [label = "provider.aws", shape = "diamond"]
"[root] aws_lambda_function.my_test_function" -> "[root] module.lambda_iam_role.output.role_arn"
"[root] meta.count-boundary (count boundary fixup)" -> "[root] aws_lambda_function.my_test_function"
"[root] meta.count-boundary (count boundary fixup)" -> "[root] module.lambda_iam_role.aws_iam_role.role (destroy)"
"[root] module.lambda_iam_role.aws_iam_role.role (destroy)" -> "[root] module.lambda_iam_role.aws_iam_role.role"
"[root] module.lambda_iam_role.aws_iam_role.role" -> "[root] provider.aws"
"[root] module.lambda_iam_role.output.role_arn" -> "[root] module.lambda_iam_role.aws_iam_role.role"
"[root] provider.aws (close)" -> "[root] aws_lambda_function.my_test_function"
"[root] provider.aws (close)" -> "[root] module.lambda_iam_role.aws_iam_role.role (destroy)"
"[root] root" -> "[root] meta.count-boundary (count boundary fixup)"
"[root] root" -> "[root] provider.aws (close)"
}
}
@samjgalbraith,
Thanks for the excellent analysis! I see exactly what you mean now, and since in most cases the effect is the same, it's gone unnoticed for quite a while.
Just recording a minimal example here for reference:
# main.tf
module "mod" {
source = "./mod"
}
variable "exit" {
default = 0
}
resource "null_resource" "b" {
triggers = {
"a" = "${module.mod.a_id}"
}
provisioner "local-exec" {
command = "exit ${var.exit}"
}
}
# ./mod/main.tf
output "a_id" {
value = "${null_resource.a.id}"
}
resource "null_resource" "a" {
lifecycle {
create_before_destroy = true
}
}
After the following commands, there should still be a deposed null_resource.a in the module state.
$ tf apply
$ tf taint -module mod null_resource.a
$ tf apply -var exit=1
Since the CBD graph is not taking into account transitive dependencies through outputs and locals, the dependency inversion that CBD creates isn't extended up through modules.
@jbardin, any thoughts on this? Maybe something that'll be fixed in 0.12?
This is the root cause of https://github.com/terraform-providers/terraform-provider-google/issues/1448. Happy to provide debug logs and the like, though it seems like the problem is pretty well-defined at this point.
cc @brandoconnor since we were debugging this earlier today :)
Hi @danawillow,
Do I have thoughts!
This isn't intended to be covered at all by the the changes in 0.12, but it's entirely possible that our graph dependency resolution changes could catch this as a side-effect. However I have a feeling it's still going to require a separate fix, and because of the scope of the pending changes it's not something I can take care of until after 0.12 lands.
Are there any plans to fix this? Looks like it still exists in 0.12
I'm going to lock this issue because it has been closed for _30 days_ โณ. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Most helpful comment
@samjgalbraith,
Thanks for the excellent analysis! I see exactly what you mean now, and since in most cases the effect is the same, it's gone unnoticed for quite a while.
Just recording a minimal example here for reference:
After the following commands, there should still be a deposed
null_resource.ain the module state.Since the CBD graph is not taking into account transitive dependencies through outputs and locals, the dependency inversion that CBD creates isn't extended up through modules.