The k8s provider does not seem to reliably work when load_config_file = false, as it would be when using this module to create a new cluster. I frequently see Unauthorized and/or attempts to call the endpoint on localhost. In one execution, I actually found that this module deleted the aws-auth config map from a cluster that was defined as the default context in my kubeconfig but was not in any way related to my terraform run.
Without defining your cluster in .kubeconfig, the provider cannot reliably be configured. This seems to make sense based on a bunch of bugs in the terraform-provider-kubernetes project, most specifically:
https://github.com/terraform-providers/terraform-provider-kubernetes/issues/521
Others seem to confirm tons of bugs when setting load_config_file = false. This one seemed most relevant since it also pointed toward long standing issues with terraform itself using interpolated values to configure a provider:
https://github.com/hashicorp/terraform/issues/4149
In my execution where it operated on a different cluster, the deletion of the config map was successful, but it attempted to apply the config map back to an endpoint listening on localhost. In other executions, the deletion attempt ran against localhost. This suggests that there is a timing issue where terraform itself has deferred configuring the provider concurrently with the attempt to manage the config map in this module.
I failed to create a cluster using the provided example, unless after cluster creation, I then added the cluster config to my local kubeconfig.
Regardless of any upstream issues with terraform and the k8s provider, this module should _never_ operate on a cluster it didn't define. Defining a local kubeconfig file and pointing the provider to that may be the best option.
I'm not sure what the best fix is here. Given the various bugs in GitHub for this, I personally feel like the only workaround here is to set manage_aws_auth = false always until the upstream provider addresses these issues.
attempts to call the endpoint on localhost
I've seen this also.
@cdaniluk
KUBERNETES_xxx) ?v1.10.0_x4$ set |grep KUBE
$
I think this would address one of the cases I've seen (Unauthorized). But not the calls to the localhost endpoint. I see Unauthorized on subsequent calls after the cluster is provisioned but calls to localhost when provisioning a new cluster.
Here's my provider config.. literally using straight from the example:
provider "kubernetes" {
host = data.aws_eks_cluster.cluster.endpoint
cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
token = data.aws_eks_cluster_auth.cluster.token
load_config_file = false
version = "~> 1.10"
}
Let me know what debug output you would like to see.
Also fwiw I figured out why the provider was talking to another cluster entirely (sort of). I imported the config map from my cluster with the provider config above. It used my default kubeconfig context, which at the time was another cluster. Thus when this module went to delete the config map to recreate it (which in and of itself is scary!), it deleted it from the original context, then attempted to recreate in the undefined / localhost context.
tbh given the handful of bugs open in the k8s provider, I think the ideal fix would be to support exporting the config map to a file as in previous releases for those of us who are scared at the thought of directly managing a resource that can permanently revoke your access to the cluster. I'm running manage_aws_auth = false right now, but that means I have to hand generate the config maps for new clusters. I'd be happy to submit a PR along those lines.
I think this would address one of the cases I've seen (Unauthorized). But not the calls to the localhost endpoint. I see Unauthorized on subsequent calls after the cluster is provisioned but calls to localhost when provisioning a new cluster.
Did you try it ?
Let me know what debug output you would like to see.
The kubernetes provider output.
@cdaniluk (cc @max-rocket-internet) I just posted on the thread you linked to. I was running into a similar problem: either Terraform would complain about a missing kubeconfig file or I would accidentally trigger updates on other clusters (because my KUBECONFIG environment variable was being used despite explicitly setting up a Terraform kubernetes provider).
I found out later that it was actually the helm provider that I had not explicitly set up that was causing all the problems. Because I didn't set up my helm provider with the appropriate Kubernetes settings, helm would complain that it couldn't load the default ~/.kube/config file, and when I happen to have KUBECONFIG set up, it would use that to spin up new pods.
If you are also using helm, you might want to give that a shot.
Best of luck!
Using helm.. but not with tf and have not set up a helm provider. This chart doesn't seem to do so implicitly.
@max-rocket-internet I need to set up an environment I can safely test this in and haven't had a chance to do so yet. Will try to over the weekend.
I have the same issue, no KUBE_ env variables.
@cdaniluk if you have the EKS cluster resource being created or updated in the same apply operation as the Kubernetes provider, things won't work as you expect. This is due to a an issue in Terraform itself.
Please see here https://www.terraform.io/docs/providers/kubernetes/index.html#stacking-with-managed-kubernetes-cluster-resources and the TF docs link in that paragraph.
@cdaniluk if you have the EKS cluster resource being created or updated in the same
applyoperation as the Kubernetes provider, things won't work as you expect. This is due to a an issue in Terraform itself.
I'm aware of the limitation, which is why it's all the more confusing that this module is basically making it impossible to bootstrap a cluster, all in the name of loading a simple configmap that is easily loaded by hand. In previous versions, you could use a null provider to script injecting the config map, and it all worked just fine. Now, not only does the k8s provider behave inconsistently (as indicated in like 50 open issues in that repo, some of which are due to tf and some of which are due to the provider itself), but we can't bootstrap a new cluster. The old version of the module allowed this.
I really think adding a flag to dump the config map to filesystem instead of trying to load it via provider would be ideal. I don't think the tf issues are going to go away anytime soon, as any issues open for dynamic provider configs, interpolation, etc have been open forever and aren't on any roadmap.
In case it's of use to anyone, I'm using the following to generate the aws-auth configmap in conjunction with manage_aws_auth = false:
worker_iam_role="$(terraform state pull | jq -r '.resources[] | select(.type=="aws_iam_role") | select(.name=="workers") | .instances[0].attributes.arn')"
cat << YAML > aws-auth-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: aws-auth
namespace: kube-system
data:
mapRoles: |
- rolearn: $worker_iam_role
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
YAML
@cdaniluk @cmrust sounds like the version 1.11.1 solves this problem. Can you confirm please ?
See also https://github.com/terraform-aws-modules/terraform-aws-eks/pull/784
Most helpful comment
I've seen this also.