Terraform-aws-eks: tf destroy fails to remove aws_auth: unauthorized

Created on 25 Dec 2020  路  12Comments  路  Source: terraform-aws-modules/terraform-aws-eks

I have issues

I'm submitting a...

  • [x ] bug report
  • [ ] feature request
  • [ ] support request - read the FAQ first!
  • [ ] kudos, thank you, warm fuzzy

What is the current behavior?

When I do a destroy operation, I receive

Error: Unauthorized

The only remaining piece of state is the aws_auth module:

 # module.kubernetes_cluster.module.eks.kubernetes_config_map.aws_auth[0] will be destroyed

Environment details

module "eks" {
  source = "terraform-aws-modules/eks/aws"

  cluster_name     = var.cluster_name
  cluster_version  = "1.18"
  subnets          = module.vpc.private_subnets
  iam_path         = "/eks/"
  write_kubeconfig = false

  kubeconfig_aws_authenticator_command = "aws"
  kubeconfig_aws_authenticator_command_args = [
    "--region",
    var.region,
    "eks",
    "get-token",
    "--cluster-name",
    var.cluster_name,
  ]

  cluster_encryption_config = [
    {
      provider_key_arn = aws_kms_key.eks.arn
      resources        = ["secrets"]
    }
  ]

  tags = merge(
    var.tags
  )

  vpc_id = module.vpc.vpc_id

  worker_groups = [
    {
      name                 = "worker"
      instance_type        = var.worker_instance_type
      asg_desired_capacity = var.number_of_workers
      asg_max_size         = 5
      iam_role_id          = "k8s-node"
    }
  ]

  workers_additional_policies = [
    "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore",
    aws_iam_policy.nodes-ec2-policy.arn
  ]
}
  • Affected module version: 3.2.1
  • OS: osx 10.15
  • Terraform version: 0.14.3

Most helpful comment

Same error here, we need to change depends_on in kubernetes_config_map.aws_auth in aws-auth.tf file. It works fine when creating cluster (terraform apply) as there is this null_resource.wait_for_cluster[0] which curls cluster, but when destroying aws_auth is destroyed after the cluster which is impossible as the cluster is no longer there. The same applies if you have kubernetes/helm provider and you want to install e.g chart, then without doing
resource "helm_release" "etwas" { depends_on = [module.eks.kubernetes_config_map.aws_auth[0]] } destroy command will leave helm_release.etwas and module.eks hanging (error).

All 12 comments

I'm also getting this error... using a bare-bones install:

data "aws_eks_cluster" "cluster" {
  name = module.my-cluster.cluster_id
}

data "aws_eks_cluster_auth" "cluster" {
  name = module.my-cluster.cluster_id
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
  token                  = data.aws_eks_cluster_auth.cluster.token
  load_config_file       = false
  version                = "~> 1.9"
}

module "my-cluster" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = var.cluster_name
  cluster_version = "1.18"
  subnets         = var.subnet_ids
  vpc_id          = data.aws_vpc.default.id
  worker_additional_security_group_ids = [ var.worker_security_group_id ]

  worker_groups = [
    {
      instance_type = var.worker_instance_type
      asg_max_size  = 5
    }
  ]
}

my terraform destroy stops with an Unauthorized error

...module.my-cluster.aws_security_group_rule.workers_egress_internet[0]: Destruction complete after 2s
module.my-cluster.aws_security_group_rule.workers_ingress_cluster[0]: Destruction complete after 3s
module.my-cluster.aws_security_group_rule.workers_ingress_cluster_https[0]: Destruction complete after 4s

Error: Unauthorized


Releasing state lock. This may take a few moments...
[terragrunt] 2020/12/28 13:45:54 Hit multiple errors:
exit status 1

with TF_LOG=TRACE, I can see that:

2020/12/28 14:15:22 [TRACE] dag/walk: visiting "provider[\"registry.terraform.io/hashicorp/aws\"] (close)"
2020/12/28 14:15:22 [TRACE] dag/walk: upstream of "meta.count-boundary (EachMode fixup)" errored, so skipping
2020/12/28 14:15:22 [TRACE] vertex "provider[\"registry.terraform.io/hashicorp/aws\"] (close)": starting visit (*terraform.graphNodeCloseProvider)
2020/12/28 14:15:22 [TRACE] GRPCProvider: Close
2020-12-28T14:15:22.924-0800 [WARN]  plugin.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = transport is closing"
2020-12-28T14:15:22.928-0800 [DEBUG] plugin: plugin process exited: path=.terraform/providers/registry.terraform.io/hashicorp/aws/3.22.0/darwin_amd64/terraform-provider-aws_v3.22.0_x5 pid=5933
2020-12-28T14:15:22.928-0800 [DEBUG] plugin: plugin exited
2020/12/28 14:15:22 [TRACE] vertex "provider[\"registry.terraform.io/hashicorp/aws\"] (close)": visit complete
2020/12/28 14:15:22 [TRACE] dag/walk: upstream of "root" errored, so skipping

Error: Delete "http://localhost/api/v1/namespaces/kube-system/configmaps/aws-auth": dial tcp [::1]:80: connect: connection refused

Terraform version: v0.14.2

It is also strange that it is querying localhost. There also seems to be an order of operations issue here, since the cluster is gone, but TF state still shows a ConfigMap remaining.

Workaround: remove the state manually

terragrunt state rm module.my-cluster.kubernetes_config_map.aws_auth[0]

did the trick for now until the bug is resolved.

Same error here, we need to change depends_on in kubernetes_config_map.aws_auth in aws-auth.tf file. It works fine when creating cluster (terraform apply) as there is this null_resource.wait_for_cluster[0] which curls cluster, but when destroying aws_auth is destroyed after the cluster which is impossible as the cluster is no longer there. The same applies if you have kubernetes/helm provider and you want to install e.g chart, then without doing
resource "helm_release" "etwas" { depends_on = [module.eks.kubernetes_config_map.aws_auth[0]] } destroy command will leave helm_release.etwas and module.eks hanging (error).

I experienced this same issue. After retrying the terraform destroy command it often deletes the EKS cluster while the kubernetes and helm resources are left behind in the state. This also leaves behind the AWS volumes and load balancers managed by Kubernetes. It really seems like the cluster is being destroyed before the resources. Versions before Terraform v14 seemed to implicitly add a dependency so that the Kubernetes resources are destroyed before the EKS cluster. After upgrading Terraform and the modules this issue arose. I also tried to build the Terraform development branch (commit 44aeaa59e70f416d582ed3ceccad7f7945f03688) from source and use this modules Github master branch, but the issue is still present.

Error: Kubernetes cluster unreachable: the server has asked for the client to provide credentials
Error: Failed to delete Ingress default/my-application-load-balancer because: Unauthorized
Error: Unauthorized

It is also strange that it is querying localhost. @spaziran

In my case, I also noticed that Terraform is trying to connect to Kubernetes at localhost, while it should connect to EKS.

@SirBarksALot Good that you noticed that module.eks.kubernetes_config_map.aws_auth[0]: Destroying... [id=kube-system/aws-auth] is ran before the Kubernetes resources are destroyed. It explains why the authentication details are no longer available afterward.

Your workaround with depends_on doesn't seem to work for me.

image

I am running into the same and related issues destroying this module with the terraform:light docker image.

I believe this is related to #978 , but I have not found any workarounds that work for automation purposes.

@TjeuKayim I am running into the same thing. I believe the comment from @SirBarksALot was really for the change inside the model. For your kubernetes_ingress resource try adding the depends_on module.eks. Hopefully, then your ingress is destroyed but it will not fix the aws_auth config map dependency issue @SirBarksALot was referring to.

I have tried many things over the last week and I came to a conclusion that it is best to create the auth config map without a help of this module (yourself) while setting manage_aws_auth=false. In my case I have helm and kubernetes resources that have depends_on=[module.eks] (so the whole module) and even then they are being destroyed after the eks destruction. I assume it is a problem with either providers and/or terraform itself. I have one more idea to try out in order to fix this issue - as I use irsa enable_irsa=true it might be a problem that the irsa resources are connected to eks in a wrong manner (I know they are as irsa resources require oicd module.eks.cluster_oidc_issuer_url). Will keep you guys updated if I find a reasonable workaround. Btw. in my case it is totally random if terraform destroy works like a charm or tumble on eks connection (i. e. auth configmap).

@TjeuKayim do not worry about the loadbalancer (and probably security group) that (I assume) ingress-nginx installation creates. If we solve the dependency/auth config map problem and the helm/k8s resources will be deleted before eks, the lb and sg will be destroyed too. Just keep in mind the destruction of lb and sg takes a few seconds during which we should not destroy eks. For that I have created null_resource that awaits lb (just have to copy it and revert for destruction xD).

Just wanted to add that our team is also experiencing this issue as well with the basic terraform example.. seems to be around 40% failure rate on destroy we get the "Unauthorized" error.. the rest of the time it works perfectly. Unfortunately this makes a cicd process very difficult so we are very eager to hear about any solutions.

For those of you who are using
terragrunt state rm module.my-cluster.kubernetes_config_map.aws_auth[0]
as a workaround are you doing this for cicd and if so do you need to do multiple destory? For example

terraform destroy (fails on unauthorized
terraform destroy (fails on unauthorized
terraform destroy (fails on localhost refused)
terragrunt state rm module.my-cluster.kubernetes_config_map.aws_auth[0]

My concern is of course do you find it's cleaning up all the terraform created resources I often had to go back into AWS manually unless I ran 'destroy' multiple times.

Thanks!

@JohnPolansky same here, sometimes it works and sometimes it doesn't. I have a feeling that if you create a cluster and immediately destroy it then it works, however if you wait a bit the auth config is not working, I have read somewhere that this config map might have a timer? Do you exprience the same thing John?

We can try to trace down what commit to the Terraform repository exactly caused the regression. I know for sure that v0.14.3 is affected and that v0.13.5 is not affected by this issue. I didn't test the versions in between. And @spaziran was using v0.14.2. Has anyone here experienced the issue with other Terraform versions? Are v0.14.{0,1} and v0.13.6 affected?

@SirBarksALot I've seen the issue on both an example where I created the cluster then within ~1 min destroy'd it .. and I've seen the issue where I created the cluster then ~2 hours later destroyed it.. and I've seen it succeed in both cases. It's very weird. I did also read somewhere there the terraform auth is only good for 15 mins.. but I don't think that applies here as my destroy's fail after ~5-7mins.

@TjeuKayim My co-worker is on 0.14.4 and I'm on 0.14.3 and we've been experienced the "unauthorized/configmap/aws-auth" issue. We are both very eager to resolve this so if you are looking for testers when the time comes, count us in.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

discordianfish picture discordianfish  路  3Comments

jimmiebtlr picture jimmiebtlr  路  3Comments

gb-ckedzierski picture gb-ckedzierski  路  5Comments

Pratima picture Pratima  路  4Comments

bshelton229 picture bshelton229  路  4Comments