Terraform-aws-eks: Master Branch - commit b623bc - race condition and syntax issues with 'iam_role_id' in newly provisioned cluster

Created on 1 Oct 2018 · 5Comments · Source: terraform-aws-modules/terraform-aws-eks

I have issues

This is a bit of a follow-up to https://github.com/terraform-aws-modules/terraform-aws-eks/issues/140

I'm submitting a...

[X] bug report
[ ] feature request
[ ] support request
[ ] kudos, thank you, warm fuzzy

What is the current behavior?

@max-rocket-internet
@laverya

Two related issues stem from PR-137
https://github.com/terraform-aws-modules/terraform-aws-eks/pull/137

Issue 1: Direct use of uncreated resource in workers_group_defaults_defaults during destroy:

module.eks.local.workers_group_defaults_defaults: local.workers_group_defaults_defaults: Resource 'aws_iam_role.workers' does not have attribute 'id' for variable 'aws_iam_role.workers.id'

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/local.tf#L33

Example Fix: iam_role_id = "${element(concat(aws_iam_role.workers.*.id, list("")), 0)}" # Use the specified IAM role if set.

Issue 2: Race Condition
Fixing issue 1 with the element selector work around (or even empty string) allows a test kitchen run to proceed, but exposes a race condition with applying the aws auth config map, attempting to apply it before the cluster is actually created in certain scenarios because the templates no longer have a direct dependency on an existing aws_iam_role.workers or the cluster itself.

v1.6.0 implicitly ordered resource execution properly to ensure
aws_eks_cluster -> eks.aws_iam_role.workers -> local_file.kubeconfig -> module.eks.eks.null_resource.update_config_map_aws_auth
so that the local-exec in update_config_map_aws_auth has a running cluster against which to execute.

master will now result in executions like the following depending on the module variables:

module.eks.null_resource.update_config_map_aws_auth (local-exec): error: stat ./kubeconfig_tf-platform-test-kitchen: no such file or directory
module.eks.null_resource.update_config_map_aws_auth: Still creating... (10s elapsed)
module.eks.aws_eks_cluster.this: Still creating... (10s elapsed)
module.eks.null_resource.update_config_map_aws_auth: Still creating... (20s elapsed)
module.eks.aws_eks_cluster.this: Still creating... (20s elapsed)

which eventually fails after it hits the aws_eks_cluster provisioning timeout

If this is a bug, how to reproduce? Please include a code sample if relevant.

Simplest way to reproduce is provisioning a new cluster with minimal default values, producing a cluster with no worker groups, but manage_aws_auth and write_kubeconfig on to allow cluster access.

module "eks" {
  source = "github.com/terraform-aws-modules/terraform-aws-eks"
  cluster_name = "eks-test-kitchen"
  subnets                                      = "<valid subnets>"
  vpc_id                                       = "<valid subnets>"
  write_kubeconfig = true
  manage_aws_auth = true
 #No worker groups desired to expose issue
 worker_group_count = 0
}

What's the expected behavior?

locals or workers_group_defaults_defaults should not have a direct dependency on a resource being available or created (e.g. locals cluster_security_group_id or worker_security_group_id already attempt to address this in their definition)

Regardless of variable values supplied, update_config_map_aws_auth should not execute until after a valid cluster is created or available.

Are you able to fix this problem and submit a PR? Link here if you have already.

Initial PR - https://github.com/terraform-aws-modules/terraform-aws-eks/pull/147

Environment details

Affected module version:
OS:
Terraform version:

Any other relevant info

Source

mmcaya

Most helpful comment

Sorry I didn't make this clear in the original report, but module.eks.local.workers_group_defaults_defaults: local.workers_group_defaults_defaults: Resource 'aws_iam_role.workers' does not have attribute 'id' for variable 'aws_iam_role.workers.id'
is on destroy after an empty destroy (e.g. test kitchen run), race condition issue occurs (timing out execution, and preventing creation of iam role) or something else prevented the resource being created. The error prevents the cleanup of the resources via destroy.

It looks to be related to these two terraform bug reports, each mentioning locals and outputs have similar behavior when resources don't exist (but are expected to)

https://github.com/hashicorp/terraform/issues/18026
https://github.com/hashicorp/terraform/issues/17691

The outputs in this module exhibit similar behavior on destroy when a resource doesn't exist, and so the element / concat pattern can be used on those as well, with no expectation that the resource has a count

       * module.eks.output.cluster_version: Resource 'aws_eks_cluster.this' does not have attribute 'version' for variable 'aws_eks_cluster.this.version'
       * module.eks.output.worker_iam_role_arn: Resource 'aws_iam_role.workers' does not have attribute 'arn' for variable 'aws_iam_role.workers.arn'

A good example is in the vpc module of this repo:
https://github.com/terraform-aws-modules/terraform-aws-vpc/blob/master/outputs.tf

You can typically get around the output errors using the existing TF_WARN_OUTPUT_ERRORS environment variable, but that has no effect on the locals, thus the need for the patch.

mmcaya on 2 Oct 2018

👍2

All 5 comments

re: issue 1 - I'm not sure why we'd need iam_role_id = "${element(concat(aws_iam_role.workers.*.id, list("")), 0)}" - this is the aws_iam_role resource code I see on master, which shouldn't need (and shouldn't be able to use!) a splat operator as it lacks a count, correct?

resource "aws_iam_role" "workers" {
  name_prefix        = "${aws_eks_cluster.this.name}"
  assume_role_policy = "${data.aws_iam_policy_document.workers_assume_role_policy.json}"
}

For comparison, the cluster and workers security group declarations both have count.

resource "aws_security_group" "cluster" {
  name_prefix = "${var.cluster_name}"
  description = "EKS cluster security group."
  vpc_id      = "${var.vpc_id}"
  tags        = "${merge(var.tags, map("Name", "${var.cluster_name}-eks_cluster_sg"))}"
  count       = "${var.cluster_security_group_id == "" ? 1 : 0}"
}

resource "aws_security_group" "workers" {
  name_prefix = "${aws_eks_cluster.this.name}"
  description = "Security group for all nodes in the cluster."
  vpc_id      = "${var.vpc_id}"
  count       = "${var.worker_security_group_id == "" ? 1 : 0}"
  tags        = "${merge(var.tags, map("Name", "${aws_eks_cluster.this.name}-eks_worker_sg", "kubernetes.io/cluster/${aws_eks_cluster.this.name}", "owned"
  ))}"
}

Of course, I might be misunderstanding how the splat operator works...

laverya on 1 Oct 2018

It looks to be related to these two terraform bug reports, each mentioning locals and outputs have similar behavior when resources don't exist (but are expected to)

https://github.com/hashicorp/terraform/issues/18026
https://github.com/hashicorp/terraform/issues/17691

       * module.eks.output.cluster_version: Resource 'aws_eks_cluster.this' does not have attribute 'version' for variable 'aws_eks_cluster.this.version'
       * module.eks.output.worker_iam_role_arn: Resource 'aws_iam_role.workers' does not have attribute 'arn' for variable 'aws_iam_role.workers.arn'

A good example is in the vpc module of this repo:
https://github.com/terraform-aws-modules/terraform-aws-vpc/blob/master/outputs.tf

You can typically get around the output errors using the existing TF_WARN_OUTPUT_ERRORS environment variable, but that has no effect on the locals, thus the need for the patch.

mmcaya on 2 Oct 2018

👍2

Small additional note - looks like the use of the ami_id = "${data.aws_ami.eks_worker.id}" inside the map is what is getting the attention of the graph walker during the destroy operation, leading to the it then noticing the non-existent iam role.

mmcaya on 2 Oct 2018

That makes a lot more sense, I hadn't quite grocked the 'tree walking during destroy' portion of things here!

laverya on 2 Oct 2018

PR Merged, thanks @laverya and @max-rocket-internet

mmcaya on 9 Oct 2018

🎉1

Was this page helpful?

0 / 5 - 0 ratings