Terraform-aws-eks: Null Resource Update Config Map AWS Auth

Created on 29 Nov 2018 · 25Comments · Source: terraform-aws-modules/terraform-aws-eks

I have issues

I'm submitting a...

[x] bug report

What is the current behavior?

Every time, I create a cluster I get this error message:

* module.eks.null_resource.update_config_map_aws_auth: Error running command
  'kubectl apply -f /k8s/config-map-aws-auth_cluster.yaml --kubeconfig /k8s/kubeconfig_cluster': exit status 1.
  Output: Unable to connect to the server: dial tcp xxx.20.110.151:443: i/o timeout

if I run terraform apply again, it will work without any error.

If this is a bug, how to reproduce? Please include a code sample if relevant.

Create a cluster using the module.

What's the expected behavior?

No, error message while creating the cluster.

Are you able to fix this problem and submit a PR? Link here if you have already.

I do not know what causes the issue.

Environment details

Affected module version: 1.7.0
OS: Ubuntu
Terraform version:

Terraform v0.11.10
+ provider.aws v1.48.0
+ provider.local v1.1.0
+ provider.null v1.0.0
+ provider.template v1.0.0

Source

Jeeppler

Most helpful comment

~~Pretty sure?~~

Right after I run terraform apply I can run

aws --profile xxx-dev eks update-kubeconfig --name xxx-dev-eks
kubectl apply -f config-map-aws-auth_xxx-dev-eks.yaml

with no errors. And that gets the worker nodes to connect to the cluster.

And terraform is using the same aws profile

EDIT: ok, no. Running kubectl get nodes --kubeconfig xxx-dev-eks gets me the You must be logged in to the server (Unauthorized) error.

Comparing what aws update-kubeconfig and terraform's kubeconfig, looks like I am missing:

      env:
      - name: AWS_PROFILE
        value: xxx-dev

And of course theres a terraform option for that: kubeconfig_aws_authenticator_env_variables.

bishtawi on 10 May 2019

🎉1 👍1

All 25 comments

OK, it looks like a problem that https://github.com/terraform-aws-modules/terraform-aws-eks/pull/187 will solve.

max-rocket-internet on 30 Nov 2018

👍1

I upgraded to 1.8.0, but then it just silently fails:

module.eks.null_resource.update_config_map_aws_auth: Provisioning with 'local-exec'...
module.eks.null_resource.update_config_map_aws_auth (local-exec): Executing: ["/bin/sh" "-c" "for i in {1..5}; do kubectl apply -f /k8s/config-map-aws-auth_cluster.yaml --kubeconfig /k8s/kubeconfig_cluster && break || sleep 10; done"]
module.eks.aws_iam_role_policy_attachment.workers_autoscaling: Creation complete after 1s (ID: cluster20181207234911216400000003-2018120723491235770000000a)

the effect is my worker nodes will not join the cluster:

$ kubectl get nodes
No resources found.

in version 1.7.0 it throws an error:

Releasing state lock. This may take a few moments...

Error: Error applying plan:

1 error(s) occurred:

* module.eks.null_resource.update_config_map_aws_auth: Error running command 'kubectl apply -f /k8s/config-map-aws-auth_cluster.yaml --kubeconfig /k8s/kubeconfig_cluster': exit status 1. Output: Unable to connect to the server: dial tcp 23.20.165.97:443: i/o timeout

I use terraform apply again and it works. At the end all nodes join the cluster.

Jeeppler on 8 Dec 2018

I still don't understand how so many people have this error 😆 Are you on some unreliable wifi or something?

max-rocket-internet on 11 Dec 2018

😄1

No, I am connected via ethernet and fiber.

Maybe using the Kubernetes Provider instead of the null resource can help with this issue.

Jeeppler on 11 Dec 2018

@max-rocket-internet Another thing I realized is that the issues started the moment I upgraded from Terraform 11.08 to 11.10. This can be a coincidence, because I changed other things as well.

Jeeppler on 11 Dec 2018

Maybe using the Kubernetes Provider instead of the null resource can help with this issue.

Does it's have some retry or timeout logic? I don't think kubectl itself has any logic like this we could use unfortunately.

started the moment I upgraded from Terraform 11.08 to 11.10

Could be but but I think the error message i/o timeout comes directly from kubectl?

I'm open for new solutions. Basically we need the loop to exit with a status of not 0 if it reaches the end.

max-rocket-internet on 12 Dec 2018

I did not have the issue with module version 2.0.

Jeeppler on 24 Dec 2018

Still an issue @Jeeppler ?

max-rocket-internet on 2 Jan 2019

@max-rocket-internet it does not seem to be an issue any more. You can close this issue.

Jeeppler on 3 Jan 2019

🎉1

OK great!

max-rocket-internet on 4 Jan 2019

👍1

@max-rocket-internet thanks for checking back.

Jeeppler on 4 Jan 2019

Hi, I'm facing the same issue. Slightly different timeout message, but essentially the same issue.

Here's the exact error message:

module.eks.null_resource.update_config_map_aws_auth (local-exec): error: unable to recognize "./config-map-aws-<redacted>.yaml": Get https://<redacted>.yl4.eu-west-1.eks.amazonaws.com/api?timeout=32s: net/http: TLS handshake timeout

It's definitely not related to a network issue (wifi or otherwise) as I've experienced it when connected to different networks.

To be honest I haven't tried this on other machines other than my laptop.

I execute terraform within a container.

OS: mac os x 10.13.6
Terraform: 0.11.11
Terraform AWS provider plugin: 1.54.0
Docker Desktop version: Community 2.0.0.0-mac81 (29211)

I am currently trying to test increasing the number of retries (e.g. from 5 to 50) in the aforementioned loop, i.e. for i in {1..5}; do kubectl apply [...]

Further information:
If I taint the resource (terraform taint -module=eks null_resource.update_config_map_aws_auth) and run terraform apply again, the operation succeeds almost immediately.

Update setting the retries to 50 didn't help: the timeout happens exactly as before.

marcelloromani on 8 Jan 2019

👍1

setting the retries to 50 didn't help: the timeout happens exactly as before.

Interesting. What could possibly be the problem then?

max-rocket-internet on 8 Jan 2019

Just like @marcelloromani I run Terraform in a container. However, I run it Docker on Linux Mint and use Ubuntu 18.04 LTS Bionic as container base image.

After building up and tearing down EKS clusters yesterday, I can confirm @marcelloromani issue. What I observed is that it sometimes works out of the box and sometimes fails. It is like flipping a coin.

This is what I receive if it does not work:

$ kubectl get nodes
No resources found.

this is what I can see in the Terraform output:

module.eks.null_resource.update_config_map_aws_auth: Still creating... (30s elapsed)
module.eks.null_resource.update_config_map_aws_auth (local-exec): Unable to connect to the server: dial tcp 18.215.4.135:443: i/o timeout
...
module.eks.null_resource.update_config_map_aws_auth: Creation complete after 41s (ID: 6879543812975939879)

Furthermore, it seems like the module.eks.null_resource.update_config_map_aws_auth is running before the worker is available:

module.eks.null_resource.update_config_map_aws_auth: Still creating... (10s elapsed)
module.eks.aws_launch_configuration.workers: Still creating... (10s elapsed)
module.module.eks.aws_launch_configuration.workers: Creation complete after 11s (ID: mycluster-worker_group_mycluster2019010816241860230000000d)
module.module.eks.aws_autoscaling_group.workers: Creating...
  arn:                            "" => "<computed>"
  ...
module.eks.null_resource.update_config_map_aws_auth: Still creating... (20s elapsed)
module.eks.aws_autoscaling_group.workers: Still creating... (10s elapsed)
module.eks.null_resource.update_config_map_aws_auth: Still creating... (30s elapsed)
module.module.eks.null_resource.update_config_map_aws_auth (local-exec): Unable to connect to the server: dial tcp 18.215.4.135:443: i/o timeout
module.eks.aws_autoscaling_group.workers: Still creating... (20s elapsed)
module.eks.null_resource.update_config_map_aws_auth: Still creating... (40s elapsed)
module.module.eks.null_resource.update_config_map_aws_auth: Creation complete after 41s (ID: 6879543812975939879)
module.eks.aws_autoscaling_group.workers: Still creating... (30s elapsed)
module.eks.aws_autoscaling_group.workers: Still creating... (40s elapsed)
module.module.eks.aws_autoscaling_group.workers: Creation complete after 40s (ID: mycluster-worker_group_mycluster2019010816242854200000000e)

I am not sure if this is suppose to be like that. My mental model is that, the auto scaling group and worker have to be up and running before we can run the config map null resource.

In addition, is there any reason this module uses a null provider and a local exec resource instead of the kubernetes provider resource for the config map?

Jeeppler on 8 Jan 2019

@max-rocket-internet please reopen this issue. I though it was fixed, but more testing revealed it is not fixed.

Jeeppler on 8 Jan 2019

Yes sir!

max-rocket-internet on 10 Jan 2019

👍1

@Jeeppler Are you using zsh by any chance?

Have you had a look at my PR https://github.com/terraform-aws-modules/terraform-aws-eks/pull/245

marcelloromani on 10 Jan 2019

In addition, is there any reason this module uses a null provider and a local exec resource instead of the kubernetes provider resource for the config map?

@Jeeppler I had the same question.

marcelloromani on 10 Jan 2019

@marcelloromani I do not use ZSH. I use Bash. However, we both have in common that we use Docker containers to run Terraform in. Maybe, the there is some networking issue.

Jeeppler on 10 Jan 2019

@Jeeppler In order to try and debug the cause of the nodes not joining the cluster, I read the boot log of the (2 in my case) EC2 instances, and found a lot of "Unauthorized" entries.

In my particular case, the issue was that the aws auth config map wasn't applied to the cluster, which doesn't seem to be the case for you.
Thought I'd mention anyway in the hope to be of help.

marcelloromani on 10 Jan 2019

Closing after no update in a long time. Feel free to reopen 🙂

max-rocket-internet on 7 May 2019

@max-rocket-internet I think I am running into a similar issue

Using:
Terraform version 0.11.13
terraform-aws-modules EKS version 4.0.2
terraform-aws-modules VPC version 1.64.0

When I run Terraform apply I see this output (Terraform returns a success btw, no error code).

module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (30s elapsed)
module.eks.module.eks.null_resource.update_config_map_aws_auth (local-exec): Unable to connect to the server: dial tcp XXX.XXX.238.240:443: i/o timeout
module.eks.eks.aws_autoscaling_group.workers_launch_template: Still creating... (30s elapsed)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (40s elapsed)
module.eks.eks.aws_autoscaling_group.workers_launch_template: Still creating... (40s elapsed)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (50s elapsed)
module.eks.module.eks.null_resource.update_config_map_aws_auth (local-exec): error: You must be logged in to the server (the server has asked for the client to provide credentials)
module.eks.eks.aws_autoscaling_group.workers_launch_template: Still creating... (50s elapsed)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (1m0s elapsed)
module.eks.module.eks.aws_autoscaling_group.workers_launch_template: Creation complete after 56s (ID: xxx-dev-eks-02019050919521734110000000f)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (1m10s elapsed)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (1m20s elapsed)
module.eks.module.eks.null_resource.update_config_map_aws_auth (local-exec): error: You must be logged in to the server (the server has asked for the client to provide credentials)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (1m30s elapsed)
module.eks.module.eks.null_resource.update_config_map_aws_auth (local-exec): error: You must be logged in to the server (the server has asked for the client to provide credentials)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (1m40s elapsed)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (1m50s elapsed)
module.eks.module.eks.null_resource.update_config_map_aws_auth (local-exec): error: You must be logged in to the server (the server has asked for the client to provide credentials)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (2m0s elapsed)
module.eks.module.eks.null_resource.update_config_map_aws_auth (local-exec): error: You must be logged in to the server (the server has asked for the client to provide credentials)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (2m10s elapsed)
module.eks.module.eks.null_resource.update_config_map_aws_auth (local-exec): error: You must be logged in to the server (the server has asked for the client to provide credentials)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (2m20s elapsed)
module.eks.module.eks.null_resource.update_config_map_aws_auth (local-exec): error: You must be logged in to the server (the server has asked for the client to provide credentials)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (2m30s elapsed)
module.eks.module.eks.null_resource.update_config_map_aws_auth (local-exec): error: You must be logged in to the server (the server has asked for the client to provide credentials)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (2m40s elapsed)
module.eks.module.eks.null_resource.update_config_map_aws_auth (local-exec): error: You must be logged in to the server (the server has asked for the client to provide credentials)
module.eks.eks.null_resource.update_config_map_aws_auth: Still creating... (2m50s elapsed)
module.eks.module.eks.null_resource.update_config_map_aws_auth: Creation complete after 2m58s (ID: 4109496625117494020)

Running kubectl get nodes returns No resources found.

Not sure how to solve this but I think its a related issue?

bishtawi on 9 May 2019

@bishtawi Are you sure you are providing the correct KUBECONFIG?

marcelloromani on 10 May 2019

~~Pretty sure?~~

Right after I run terraform apply I can run

aws --profile xxx-dev eks update-kubeconfig --name xxx-dev-eks
kubectl apply -f config-map-aws-auth_xxx-dev-eks.yaml

with no errors. And that gets the worker nodes to connect to the cluster.

And terraform is using the same aws profile

EDIT: ok, no. Running kubectl get nodes --kubeconfig xxx-dev-eks gets me the You must be logged in to the server (Unauthorized) error.

Comparing what aws update-kubeconfig and terraform's kubeconfig, looks like I am missing:

      env:
      - name: AWS_PROFILE
        value: xxx-dev

And of course theres a terraform option for that: kubeconfig_aws_authenticator_env_variables.

bishtawi on 10 May 2019

🎉1 👍1

@Jeeppler I saw this issue as well when I used terraform from within a container using https://github.com/toolbox-cli/toolbox; even though it has been working for weeks. I switched to local install of terraform v0.11.13 until I debug the toolbox container configuration.

The only recent change is that we nested our terraform code into another module level; if that makes sense to you. Perhaps, that in combination is the issue. As I mentioned above, this setup was working for me until we did more nested modules. Perhaps it's a shell issue within my container.

This issue seems related to https://github.com/terraform-aws-modules/terraform-aws-eks/issues/341, though