Terraform-aws-eks: Managed Node Groups attempting unnecessary asset replacement

Created on 7 Dec 2019 · 6Comments · Source: terraform-aws-modules/terraform-aws-eks

I have issues

Error: error creating EKS Node Group ${EKS_CLUSTER}:${EKS_CLUSTER}-${NODE_NAME}-${RANDOM_PET): ResourceInUseException: NodeGroup already exists with name ${EKS_CLUSTER}-${NODE_NAME}-${RANDOM_PET) and cluster name ${EKS_CLUSTER}
    status code: 409, request id: 722abab9-21b6-418a-99c8-8c974adbf16a

  on ../../terraform-aws-eks/node_groups.tf line 69, in resource "aws_eks_node_group" "workers":
  69: resource "aws_eks_node_group" "workers" {

I'm submitting a...

[x] bug report
[ ] feature request
[ ] support request - read the FAQ first!
[x] kudos, thank you, warm fuzzy

What is the current behavior?

Currently the module should only be replacing managed nodes when either instance_type, ec2_ssh_key, source_security_group_ids, or node_group_name changes but is attempting to replace upon any change. This is then causing a name collision, something that random_pet should be preventing(?).

If this is a bug, how to reproduce? Please include a code sample if relevant.

Terraform Apply Creates VPC, EKS Cluster, and Managed Worker Node
Terraform Apply Attempts to re-create Managed Worker Node
Fails due to duplicate name.

What's the expected behavior?

The second terraform apply should not be attempting a management group replacement since nothing has changed.

Are you able to fix this problem and submit a PR? Link here if you have already.

Environment details

Affected module version: Master
OS: OSX
Terraform version: 0.12.10

Any other relevant info

It is worth noting that if any of the keepers for random_pet do change the expected create_before_destroy behavior is respected.

TF Output(Slightly Redacted)

Terraform will perform the following actions:

  # module.${CLUSTER_NAME}-eks.aws_eks_node_group.workers["${NODE_NAME}"] must be replaced
+/- resource "aws_eks_node_group" "workers" {
      ~ ami_type        = "AL2_x86_64" -> (known after apply)
      ~ arn             = "arn:aws:eks:us-east-1:${AWS_ACCOUNT}:nodegroup/${CLUSTER_NAME}/${CLUSTER_NAME}-${NODE_NAME}-${RANDOM_PET}/a8b770ba-3a3b-bead-1e7d-868c50634d14" -> (known after apply)
        cluster_name    = "${CLUSTER_NAME}"
      ~ disk_size       = 20 -> (known after apply)
      ~ id              = "${CLUSTER_NAME}:${CLUSTER_NAME}-${NODE_NAME}-${RANDOM_PET}" -> (known after apply)
        instance_types  = [
            "m5a.large",
        ]
        labels          = {
            "NodeGroupType" = "Managed"
        }
        node_group_name = "${CLUSTER_NAME}-${NODE_NAME}-${RANDOM_PET}"
        node_role_arn   = "arn:aws:iam::${AWS_ACCOUNT}:role/${CLUSTER_NAME}-managed-node-groups"
      ~ release_version = "1.14.7-20190927" -> (known after apply)
      ~ resources       = [
          - {
              - autoscaling_groups              = [
                  - {
                      - name = "eks-REDACT"
                    },
                ]
              - remote_access_security_group_id = ""
            },
        ] -> (known after apply)
      ~ status          = "ACTIVE" -> (known after apply)
        subnet_ids      = [
            "subnet-REDACT",
            "subnet-REDACT",
            "subnet-REDACT",
        ]
        tags            = {
            "NodeGroupType" = "Managed"
        }
        version         = "1.14"

      + remote_access { # forces replacement}

        scaling_config {
            desired_size = 1
            max_size     = 10
            min_size     = 1
        }
    }

Source

kamirendawkins

👍3

Most helpful comment

I was having the same issue with the default configuration in the managed_node_group example. I think it can be fixed by making the remote_access block dynamic. Something like:

  dynamic "remote_access" {
    for_each = [for s in [{
      ec2_ssh_key               = lookup(each.value, "key_name", "") != "" ? each.value["key_name"] : null
      source_security_group_ids = lookup(each.value, "key_name", "") != "" ? lookup(each.value, "source_security_group_ids", []) : null
    }] : s if s["ec2_ssh_key"] != null || s["source_security_group_ids"] != null ]

    content {
      ec2_ssh_key = remote_access.value["ec2_ssh_key"]
      source_security_group_ids = remote_access.value["source_security_group_ids"]
    }
  }

I'll open a PR with these changes if they resolve the issue.

jeffmhastings on 9 Dec 2019

👍3

All 6 comments

So this is probably related to the notes left on node_group.tf:

# This sometimes breaks idempotency as described in https://github.com/terraform-providers/terraform-provider-aws/issues/11063
  remote_access {
    ec2_ssh_key               = lookup(each.value, "key_name", "") != "" ? each.value["key_name"] : null
    source_security_group_ids = lookup(each.value, "key_name", "") != "" ? lookup(each.value, "source_security_group_ids", []) : null
  }

Upon setting at least the ec2_ssh_key I am able to maintain expected behavior.

kamirendawkins on 7 Dec 2019

I was having the same issue with the default configuration in the managed_node_group example. I think it can be fixed by making the remote_access block dynamic. Something like:

  dynamic "remote_access" {
    for_each = [for s in [{
      ec2_ssh_key               = lookup(each.value, "key_name", "") != "" ? each.value["key_name"] : null
      source_security_group_ids = lookup(each.value, "key_name", "") != "" ? lookup(each.value, "source_security_group_ids", []) : null
    }] : s if s["ec2_ssh_key"] != null || s["source_security_group_ids"] != null ]

    content {
      ec2_ssh_key = remote_access.value["ec2_ssh_key"]
      source_security_group_ids = remote_access.value["source_security_group_ids"]
    }
  }

I'll open a PR with these changes if they resolve the issue.

jeffmhastings on 9 Dec 2019

👍3

@jeffmhastings I tried your snippet and it worked well for me