Terraform-provider-kubernetes: v2.0.1 Authentication failures with token retrieved via aws_eks_cluster_auth

Created on 26 Jan 2021 · 15Comments · Source: hashicorp/terraform-provider-kubernetes

Terraform Version, Provider Version and Kubernetes Version

Terraform version: 0.12.24
Kubernetes provider version: 2.0.1
Kubernetes version: v1.16.15-eks-ad4801

Affected Resource(s)

Terraform Configuration Files

data "aws_eks_cluster" "c" {
  name = var.k8s_name
}

data "aws_eks_cluster_auth" "c" {
  name = var.k8s_name
}

provider "kubernetes" {
  host = data.aws_eks_cluster.c.endpoint

  cluster_ca_certificate = base64decode(data.aws_eks_cluster.c.certificate_authority.0.data)

  token = data.aws_eks_cluster_auth.c.token
}

Debug Output

Panic Output

Steps to Reproduce

Expected Behavior

What should have happened?
Resources should have been created/modified/deleted.1

Actual Behavior

What actually happened?

Error: the server has asked for the client to provide credentials
Error: Failed to update daemonset: Unauthorized
Error: Failed to update deployment: Unauthorized
Error: Failed to update deployment: Unauthorized
Error: Failed to update service account: Unauthorized
Error: Failed to update service account: Unauthorized
Error: Failed to delete Job! API error: Unauthorized
Error: Failed to update service account: Unauthorized
Error: the server has asked for the client to provide credentials
Error: the server has asked for the client to provide credentials
Error: Failed to update deployment: Unauthorized
Error: Failed to update service account: Unauthorized
Error: the server has asked for the client to provide credentials
Error: Failed to delete Job! API error: Unauthorized
Error: Failed to update daemonset: Unauthorized

Important Factoids

No, we're just using EKS.

References

GH-1234

Community Note

Please vote on this issue by adding a +1 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

bug themauth

Source

tomaspinho

👍9 👀1

Most helpful comment

Using exec is not a viable solution when running in terraform cloud using remote execution. Our current thinking is to implement a workaround to essentially taint the aws_eks_cluster_auth data source so it gets refreshed for every plan. It would be ideal if the kubernetes provider had native support for getting and refreshing managed kubernetes service authentication tokens / credentials in order to support environments in which the only guaranteed tooling is terraform itself.

albertrdixon on 14 Apr 2021

👍4

All 15 comments

Hi, same problem here with Terraform v0.14.5, but different error message:

Error: Kubernetes cluster unreachable: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

And the configuration is the same as with previous version provider.

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.eks.token
}

angelabad on 27 Jan 2021

Can you try running terraform refresh to see if that pulls in a new token? The token generated by aws_eks_cluster_auth is only valid for 15 minutes. For this reason, we recommend using an exec plugin to keep the token up to date automatically. Here's an example of that configuration:

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", var.cluster_name]
    command     = "aws"
  }
}

Alternatively, running the Kubernetes provider in separate terraform apply from the EKS cluster creation should work every time. (I'm not sure offhand if your EKS cluster is being created in the same apply, but just guessing since it's a common configuration).

There's also a working EKS example you can compare with your configs. There are some improvements coming soon for the example, since we're working on related authentication issues.

dak1n1 on 10 Feb 2021

👍3 👎1

@dak1n1 I am considering this as a temporary workaround.

nikitazernov on 10 Feb 2021

Can you try running terraform refresh to see if that pulls in a new token? The token generated by aws_eks_cluster_auth is only valid for 15 minutes. For this reason, we recommend using an exec plugin to keep the token up to date automatically. Here's an example of that configuration:
provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", var.cluster_name]
    command     = "aws"
  }
}
Alternatively, running the Kubernetes provider in separate terraform apply from the EKS cluster creation should work every time. (I'm not sure offhand if your EKS cluster is being created in the same apply, but just guessing since it's a common configuration).

There's also a working EKS example you can compare with your configs. There are some improvements coming soon for the example, since we're working on related authentication issues.

Not sure about the 15mins issue, as we've been using this provider for almost a year now and the token validity has never been a problem. In fact, downgrading the provider to <2.0 works as expected.

I'll try force refreshing the token and report back the results.

tomaspinho on 10 Feb 2021

Not sure about the 15mins issue, as we've been using this provider for almost a year now and the token validity has never been a problem. In fact, downgrading the provider to <2.0 works as expected.

I'll try force refreshing the token and report back the results.

Thanks! And about the downgrade fixing this -- that makes sense. Depending on your provider configuration, prior to 2.0, the Kubernetes provider may have actually been reading the KUBECONFIG environment variable (despite your valid configuration which includes a token and does not reference the kubeconfig file). This was a source of confusion that we were aiming to alleviate. The authentication workflow still needs some work though.

dak1n1 on 10 Feb 2021

Not sure about the 15mins issue, as we've been using this provider for almost a year now and the token validity has never been a problem. In fact, downgrading the provider to <2.0 works as expected.

I'll try force refreshing the token and report back the results.

Thanks! And about the downgrade fixing this -- that makes sense. Depending on your provider configuration, prior to 2.0, the Kubernetes provider may have actually been reading the KUBECONFIG environment variable (despite your valid configuration which includes a token and does not reference the kubeconfig file). This was a source of confusion that we were aiming to alleviate. The authentication workflow still needs some work though.

The KUBECONFIG issue is not present in our environment as we run Terraform in GitLab CI and never use that file to authenticate to clusters from it.

tomaspinho on 10 Feb 2021

Terraform version: 0.14.5
Kubernetes provider version: 2.0.2
Kubernetes version: v1.18.9

I tried an apply with a clean state using the exec instead of the token in the kubernetes provider on the initial run when the eks cluster is created. I get the same Error: Unauthorized results for both when trying to apply my kubernetes resources.

Using the exec

```terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 3.26.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.0.2"
}
}
}

provider "aws" {
region = var.region
}

data "aws_eks_cluster_auth" "cluster_token" {
name = module.eks.name
}
provider "kubernetes" {
host = module.eks.endpoint
cluster_ca_certificate = base64decode(module.eks.certificate)
exec {
api_version = "client.authentication.k8s.io/v1alpha1"
args = ["eks", "get-token", "--cluster-name", module.eks.name]
command = "aws"
}
}
```

The kubernetes resources are created correctly on a retry of the pipeline as stated in the comments above; using the token or exec method.

loungerider on 10 Feb 2021

Terraform version: 0.14.5
Kubernetes provider version: 2.0.2
Kubernetes version: v1.18.9
I tried an apply with a clean state using the exec instead of the token in the kubernetes provider on the initial run when the eks cluster is created. I get the same Error: Unauthorized results for both when trying to apply my kubernetes resources.

Using the exec
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.26.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.0.2"
    }
  }
}

provider "aws" {
  region = var.region
}

data "aws_eks_cluster_auth" "cluster_token" {
  name = module.eks.name
}
provider "kubernetes" {
  host                   = module.eks.endpoint
  cluster_ca_certificate = base64decode(module.eks.certificate)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", module.eks.name]
    command     = "aws"
  }
}
The kubernetes resources are created correctly on a retry of the pipeline as stated in the comments above; using the token or exec method.

@loungerider Thanks for testing this. I believe the issue in your case has to do with certain parameters passed into the Kubernetes provider which are unknown at the time of the provider initialization. I'm guessing module.eks.endpoint is unknown at plan time, but also the data source is probably being read too soon.

In the data source, the value of name = module.eks.name is likely known before the cluster is ready. So the data source will read the cluster too early, and pass invalid credentials into the Kubernetes provider. I'll show you an example that will make the data source wait until the cluster is ready:

data "aws_eks_cluster" "default" {
  name = module.eks.cluster_id
}

# This data source is only needed if you're passing the token into the provider using `token =`.
data "aws_eks_cluster_auth" "default" {
  name = module.eks.cluster_id
}

provider "kubernetes" {
  # This defers provider initialization until the cluster is ready
  host                   = data.aws_eks_cluster.default.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)

  # This keeps the token up-to-date during subsequent applies, even if they run longer than the token TTL.
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", module.eks.name]
    command     = "aws"
  }
}

I'm assuming you're using the EKS module here, which has an output that waits for the cluster API to be ready (cluster_id). That's why the data source needs to know about cluster_id. Another option would be to add a depends_on explicitly to wait for this field (depends_on = [module.eks.cluster_id])

I also added a data source to read the cluster's hostname and CA cert data, so it will be able to read the new hostname and certs, if those ever change, such as on the first apply, or during cluster replacement.

Although a single apply scenario like this is less reliable than running apply twice, it is possible to do, it just has these gotchas to be aware of.

dak1n1 on 17 Feb 2021

@dak1n1 I'm getting the same errors with the following:

Terraform version: 0.14.6
Kubernetes provider version: 2.0.2
EKS version: v1.18.9 -> v1.19.6

As you can see the the only change I'm attempting is to upgrade EKS from 1.18 to 1.19. With out posting all the code the relevant portions:

resource "null_resource" "wait_for_cluster" {
  depends_on = [aws_eks_cluster.cluster]

  provisioner "local-exec" {
    command     = "for i in `seq 1 60`; do if `command -v wget > /dev/null`; then wget --no-check-certificate -O - -q $ENDPOINT/healthz >/dev/null && exit 0 || true; else curl -k -s $ENDPOINT/healthz >/dev/null && exit 0 || true;fi; sleep 5; done; echo TIMEOUT && exit 1"
    interpreter = ["/bin/sh", "-c"]
    environment = {
      ENDPOINT = aws_eks_cluster.cluster.endpoint
    }
  }
}

data "aws_eks_cluster" "eks_cluster" {
  name       = aws_eks_cluster.cluster.name
  depends_on = [null_resource.wait_for_cluster]
}

data "aws_eks_cluster_auth" "eks_cluster" {
  name       = aws_eks_cluster.cluster.name
  depends_on = [null_resource.wait_for_cluster]
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks_cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks_cluster.certificate_authority.0.data)
  token                  = data.aws_eks_cluster_auth.eks_cluster.token
}

provider "helm" {
  kubernetes {
    host                   = data.aws_eks_cluster.eks_cluster.endpoint
    token                  = data.aws_eks_cluster_auth.eks_cluster.token
    cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks_cluster.certificate_authority.0.data)
  }
}

My module follows the same conventions as the module you mentioned above except that I'm using the token instead of the exec method. We use Terraform Cloud for our workflow and I don't believe the AWS CLI is installed on those workers. The docs also warn against trying to install extra software on workers and even if you decide to ignore that advise doing so is kinda hacky to say the least. So IMO using the aws cli to generate creds should not be a solution to this issue.

I've tried running this multiple times and always get errors like these:

Error: Kubernetes cluster unreachable: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
Error: Get "http://localhost/apis/rbac.authorization.k8s.io/v1/namespaces/default/rolebindings/edit": dial tcp 127.0.0.1:80: connect: connection refused

My first question would be, is the token being stored somewhere in the state? I would assume the data source would be refreshed every run in case something changed (in this case I assume the token would be new with every run) therefore the 15 minute expiration should only be an issue on initial cluster creation where the token is created before the cluster. In the case above I would assume that should never happen due to the dependency chain of aws_eks_cluster -> null_resource -> aws_eks_cluster_auth.

If the token is refreshed every time then why am I seeing this error when specifying an upgrade to an already provisioned cluster. The upgrade is not changing the cluster name, it should change in place. The existing cluster should be there, so the token should be created and the provider should be able to read the cluster state and make an appropriate plan. I also find it very curious that I don't see any errors like this related to resources provisioned by the helm provider. I don't know if maybe that's because the errors in the kubernetes provider are ending the plan before it gets to helm or if there is something different in how Helm is doing things that dodges this issue.

I may try downgrading my provider to < 2.0 to see if this works there. If that's the case it's not a hidden KUBECONFIG file issue as you mentioned above because we run this on TFC and don't generate a KUBECONFIG file in our TF code for clusters. If I do try this I will try to remember to post results here.

jw-maynard on 21 Feb 2021

Did some further digging and we may be barking in the wrong place: https://github.com/hashicorp/terraform-provider-aws/issues/10269#issuecomment-777906069

jw-maynard on 21 Feb 2021

👍2

@jw-maynard I'm glad you found that other issue! It sounds like the EKS cluster could be getting replaced rather than updated in-place. Could you do a terraform plan to confirm this? (There should be a line that tells you if a change "forces replacement").

What I saw in your configuration is what we call a "single apply" scenario (that is, a configuration which contains both the EKS cluster (aws_eks_cluster.cluster) and the Kubernetes resources that will live on that cluster. In a single apply scenario, any replacement of an underlying Kubernetes cluster will cause the Kubernetes provider to fail to initialize, unless you do a specific workaround that I'll mention below.

This is a known limitation in Terraform core, which I recently saw described well in this comment. It's a problem any time you have a provider that depends on a resource (in this case, the Kubernetes provider is dependent on information from aws_eks_cluster.cluster, which is read from the data source... but that information is not available when the provider is initialized, because, presumably, the cluster is getting replaced).

If an underlying Kubernetes cluster is going to be replaced, and you already have Kubernetes resources provisioned using the Kubernetes provider, you'll have to work around this issue by doing a terraform state rm on the module containing all the Kubernetes resources (there's an example here). That way the Kubernetes resources will be recreated on the new cluster, and the terraform plan will succeed. Otherwise, the provider tries to initialize using an empty credentials block, since it does not yet know the credentials associated with the cluster being replaced.

This workaround is only needed in single-apply scenarios where you have the cluster and the Kubernetes resources sharing a single state. In general, it's more reliable to keep the Kubernetes resources in a separate state from the EKS cluster resource (for example, a different workspace in TFC, or a different root module). Two applies will work every time, but a single apply involves some work-arounds, depending on the scenario.

dak1n1 on 23 Feb 2021

@dak1n1 It never gets that far because the plan errors but I know that version upgrades in EKS are an update in place scenario for sure. I guess they could have introduced a bug in the aws provider but I don't think so.

I did a lot of digging around in logs at the TRACE level for this plan and found some differences in how a successful plan handles the two data sources compared to how it handles them in a plan where I try to upgrade the version. Unfortunately I'm not familiar enough with the inner workings of TF and it's providers to know if this is fixable in the provider or not. I'm happy to share my findings privately with anyone at HashiCorp who's willing to listen. Single apply scenarios seem to be something that a fair number of people would like to be able to do when working with Kubernetes on cloud providers.

I can share what I think it the difference in the two runs. The failed one ends up in here for both EKS data sources (I'm just sharing aws_eks_cluster_auth but aws_eks_cluster has a the same log line:

2021/02/21 20:39:29 [TRACE] evalReadDataPlan: module.kubernetes_cluster.data.aws_eks_cluster_auth.eks_cluster configuration is fully known, but we're forcing a read plan to be created

This appears to becoming from here https://github.com/hashicorp/terraform/blob/618a3edcd13f5231a77a699b7ba2a3fba352b7a3/terraform/eval_read_data_plan.go#L65 which tells me that n.forcePlanRead(ctx) is True. Since the successful runs hit a log that comes from L107 (linked below) it seems to point to the failures running into something inside the if block from L63 to L103 and falling apart there.

A working run where the version is not updated I don't see the above at all but I see this:

2021/02/21 20:37:10 [TRACE] EvalReadData: module.kubernetes_cluster.data.aws_eks_cluster_auth.eks_cluster configuration is complete, so reading from provider
2021/02/21 20:37:10 [TRACE] GRPCProvider: ReadDataSource
2021-02-21T20:37:10.945Z [INFO]  plugin.terraform-provider-aws_v3.29.0_x5: 2021/02/21 20:37:10 [DEBUG] Reading EKS Cluster: {
  Name: "kubernetes01"
}: timestamp=2021-02-21T20:37:10.943Z
2021/02/21 20:37:10 [WARN] Provider "registry.terraform.io/hashicorp/aws" produced an unexpected new value for module.kubernetes_cluster.data.aws_eks_cluster_auth.eks_cluster.
      - .token: inconsistent values for sensitive attribute

Then a call to eks/DescribeCluster. This EvalReadData appears to be logged inside the readDataSource here https://github.com/hashicorp/terraform/blob/618a3edcd13f5231a77a699b7ba2a3fba352b7a3/terraform/eval_read_data_plan.go#L107

So in the failed state it seems like the data source is not even updating for some reason. Odd considering the cluster would be updated in place. The fact that there's no read of the data source in the failure when something is changing just makes me feel like there's a logical bug somewhere maybe in core, but I don't feel knowledgeable enough to articulate it in an issue over there.

All that being said I am aware of the pitfalls with single apply scenarios and this certainly maybe one of those issues. The unfortunate part is that like they do with the EKS module you posted above, there are some things in EKS that require managing resource inside the cluster (aws-auth being a notable one) and it seems clunky to have to use two modules to fully provision one resource (EKS) to our specs.

jw-maynard on 24 Feb 2021

@dak1n1 This config worked for me. Thanks!

```
data "aws_eks_cluster" "default" {
name = module.eks.name
depends_on = [module.eks.name]
}

data "aws_eks_cluster_auth" "default" {
name = module.eks.name
}

provider "kubernetes" {
host = data.aws_eks_cluster.default.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)
exec {
api_version = "client.authentication.k8s.io/v1alpha1"
args = ["eks", "get-token", "--cluster-name", module.eks.name]
command = "aws"
}
}

loungerider on 26 Feb 2021

🚀1 🎉1

albertrdixon on 14 Apr 2021

👍4

We faced with the same issue when running destroy (introduced in Terraform 0.14). Actually multiple providers affected helm, kubernetes, kubernetes-alpha. In 0.14 data sources are no longer refreshed on destroy, which is causing provider issues, it was implemented as part of:
https://github.com/hashicorp/terraform/issues/15386

Related issue is (which is closed):
https://github.com/hashicorp/terraform/issues/27172

For example any providers using datasource aws_eks_cluster_auth will fail on destroy:

data "aws_eks_cluster_auth" "cluster" {
  name = var.cluster_name
}

The proposed workaround is to run plan or refresh (which may not be the best solution for every team).

vitali-shcharbin on 30 Apr 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to avoid kubernetes_deployment from always overriding the image field?

ZachGoldberg · 3Comments

Upgrading a daemonset to provider version 0.13 fails to apply

dhild · 4Comments

Cannot connect to Azure AKS cluster using Kubernetes Provider

hashibot[bot] · 4Comments

kubernetes_persistent_volume_claim resource overrides empty storage_class_name with "default" value

ragelo · 3Comments

Proposal: Introduce raw manifest resource

burdiyan · 4Comments