This is a bit of a follow-up to https://github.com/terraform-aws-modules/terraform-aws-eks/issues/140
@max-rocket-internet
@laverya
Two related issues stem from PR-137
https://github.com/terraform-aws-modules/terraform-aws-eks/pull/137
Issue 1: Direct use of uncreated resource in workers_group_defaults_defaults during destroy:
module.eks.local.workers_group_defaults_defaults: local.workers_group_defaults_defaults: Resource 'aws_iam_role.workers' does not have attribute 'id' for variable 'aws_iam_role.workers.id'
https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/local.tf#L33
Example Fix: iam_role_id = "${element(concat(aws_iam_role.workers.*.id, list("")), 0)}" # Use the specified IAM role if set.
Issue 2: Race Condition
Fixing issue 1 with the element selector work around (or even empty string) allows a test kitchen run to proceed, but exposes a race condition with applying the aws auth config map, attempting to apply it before the cluster is actually created in certain scenarios because the templates no longer have a direct dependency on an existing aws_iam_role.workers or the cluster itself.
v1.6.0 implicitly ordered resource execution properly to ensure
aws_eks_cluster -> eks.aws_iam_role.workers -> local_file.kubeconfig -> module.eks.eks.null_resource.update_config_map_aws_auth
so that the local-exec in update_config_map_aws_auth has a running cluster against which to execute.
master will now result in executions like the following depending on the module variables:
module.eks.null_resource.update_config_map_aws_auth (local-exec): error: stat ./kubeconfig_tf-platform-test-kitchen: no such file or directory
module.eks.null_resource.update_config_map_aws_auth: Still creating... (10s elapsed)
module.eks.aws_eks_cluster.this: Still creating... (10s elapsed)
module.eks.null_resource.update_config_map_aws_auth: Still creating... (20s elapsed)
module.eks.aws_eks_cluster.this: Still creating... (20s elapsed)
which eventually fails after it hits the aws_eks_cluster provisioning timeout
Simplest way to reproduce is provisioning a new cluster with minimal default values, producing a cluster with no worker groups, but manage_aws_auth and write_kubeconfig on to allow cluster access.
module "eks" {
source = "github.com/terraform-aws-modules/terraform-aws-eks"
cluster_name = "eks-test-kitchen"
subnets = "<valid subnets>"
vpc_id = "<valid subnets>"
write_kubeconfig = true
manage_aws_auth = true
#No worker groups desired to expose issue
worker_group_count = 0
}
locals or workers_group_defaults_defaults should not have a direct dependency on a resource being available or created (e.g. locals cluster_security_group_id or worker_security_group_id already attempt to address this in their definition)
Regardless of variable values supplied, update_config_map_aws_auth should not execute until after a valid cluster is created or available.
Initial PR - https://github.com/terraform-aws-modules/terraform-aws-eks/pull/147
re: issue 1 - I'm not sure why we'd need iam_role_id = "${element(concat(aws_iam_role.workers.*.id, list("")), 0)}" - this is the aws_iam_role resource code I see on master, which shouldn't need (and shouldn't be able to use!) a splat operator as it lacks a count, correct?
resource "aws_iam_role" "workers" {
name_prefix = "${aws_eks_cluster.this.name}"
assume_role_policy = "${data.aws_iam_policy_document.workers_assume_role_policy.json}"
}
For comparison, the cluster and workers security group declarations both have count.
resource "aws_security_group" "cluster" {
name_prefix = "${var.cluster_name}"
description = "EKS cluster security group."
vpc_id = "${var.vpc_id}"
tags = "${merge(var.tags, map("Name", "${var.cluster_name}-eks_cluster_sg"))}"
count = "${var.cluster_security_group_id == "" ? 1 : 0}"
}
resource "aws_security_group" "workers" {
name_prefix = "${aws_eks_cluster.this.name}"
description = "Security group for all nodes in the cluster."
vpc_id = "${var.vpc_id}"
count = "${var.worker_security_group_id == "" ? 1 : 0}"
tags = "${merge(var.tags, map("Name", "${aws_eks_cluster.this.name}-eks_worker_sg", "kubernetes.io/cluster/${aws_eks_cluster.this.name}", "owned"
))}"
}
Of course, I might be misunderstanding how the splat operator works...
Sorry I didn't make this clear in the original report, but module.eks.local.workers_group_defaults_defaults: local.workers_group_defaults_defaults: Resource 'aws_iam_role.workers' does not have attribute 'id' for variable 'aws_iam_role.workers.id'
is on destroy after an empty destroy (e.g. test kitchen run), race condition issue occurs (timing out execution, and preventing creation of iam role) or something else prevented the resource being created. The error prevents the cleanup of the resources via destroy.
It looks to be related to these two terraform bug reports, each mentioning locals and outputs have similar behavior when resources don't exist (but are expected to)
https://github.com/hashicorp/terraform/issues/18026
https://github.com/hashicorp/terraform/issues/17691
The outputs in this module exhibit similar behavior on destroy when a resource doesn't exist, and so the element / concat pattern can be used on those as well, with no expectation that the resource has a count
* module.eks.output.cluster_version: Resource 'aws_eks_cluster.this' does not have attribute 'version' for variable 'aws_eks_cluster.this.version'
* module.eks.output.worker_iam_role_arn: Resource 'aws_iam_role.workers' does not have attribute 'arn' for variable 'aws_iam_role.workers.arn'
A good example is in the vpc module of this repo:
https://github.com/terraform-aws-modules/terraform-aws-vpc/blob/master/outputs.tf
You can typically get around the output errors using the existing TF_WARN_OUTPUT_ERRORS environment variable, but that has no effect on the locals, thus the need for the patch.
Small additional note - looks like the use of the ami_id = "${data.aws_ami.eks_worker.id}" inside the map is what is getting the attention of the graph walker during the destroy operation, leading to the it then noticing the non-existent iam role.
That makes a lot more sense, I hadn't quite grocked the 'tree walking during destroy' portion of things here!
PR Merged, thanks @laverya and @max-rocket-internet
Most helpful comment
Sorry I didn't make this clear in the original report, but
module.eks.local.workers_group_defaults_defaults: local.workers_group_defaults_defaults: Resource 'aws_iam_role.workers' does not have attribute 'id' for variable 'aws_iam_role.workers.id'is on destroy after an empty destroy (e.g. test kitchen run), race condition issue occurs (timing out execution, and preventing creation of iam role) or something else prevented the resource being created. The error prevents the cleanup of the resources via destroy.
It looks to be related to these two terraform bug reports, each mentioning
localsandoutputshave similar behavior when resources don't exist (but are expected to)https://github.com/hashicorp/terraform/issues/18026
https://github.com/hashicorp/terraform/issues/17691
The outputs in this module exhibit similar behavior on destroy when a resource doesn't exist, and so the element / concat pattern can be used on those as well, with no expectation that the resource has a
countA good example is in the vpc module of this repo:
https://github.com/terraform-aws-modules/terraform-aws-vpc/blob/master/outputs.tf
You can typically get around the output errors using the existing TF_WARN_OUTPUT_ERRORS environment variable, but that has no effect on the
locals, thus the need for the patch.