Terraform-aws-eks: Incomplete documentation: Can't create LoadBalancer (EXTERNAL-IP <pending>)

Created on 9 Feb 2019  路  13Comments  路  Source: terraform-aws-modules/terraform-aws-eks

I have issues

Incomplete documentation

I'm submitting a...

  • [X] bug report
  • [ ] feature request
  • [ ] support request
  • [X] kudos, thank you, warm fuzzy

What is the current behavior?

apply'ing a standard Terraform set of VPC and EKS definitions erases VPC and subnet tags added by AWS, resulting in Kubernetes being unable to create load balancers.

If this is a bug, how to reproduce? Please include a code sample if relevant.

Consider a standard VPC and EKS definition like this:

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"

  name = "foo-vpc"
  cidr = "10.0.0.0/16"

  azs = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
  private_subnets = ["10.0.64.0/19", "10.0.96.0/19", "10.0.128.0/19"]
  public_subnets = ["10.0.160.0/19", "10.0.192.0/19", "10.0.224.0/19"]

  enable_nat_gateway = true
  enable_vpn_gateway = true
  enable_dns_hostnames = true
}

module "foo-cluster" {
  cluster_name = "foo-cluster"
  source = "terraform-aws-modules/eks/aws"
  subnets = ["${module.vpc.public_subnets}"]
  vpc_id = "${module.vpc.vpc_id}"
  worker_groups = [
    {
      asg_max_size = 2
      instance_type = "t3.small"
    }
  ]
}

Run terraform apply

Run terraform apply again. This time it wants to make these changes:

  ~ module.foo.module.vpc.aws_subnet.public[0]
      tags.%:                                    "2" => "1"
      tags.kubernetes.io/cluster/foo-cluster: "shared" => ""

  ~ module.foo.module.vpc.aws_subnet.public[1]
      tags.%:                                    "2" => "1"
      tags.kubernetes.io/cluster/foo-cluster: "shared" => ""

  ~ module.foo.module.vpc.aws_subnet.public[2]
      tags.%:                                    "2" => "1"
      tags.kubernetes.io/cluster/foo-cluster: "shared" => ""

  ~ module.foo.module.vpc.aws_vpc.this
      tags.%:                                    "2" => "1"
      tags.kubernetes.io/cluster/foo-cluster: "shared" => ""

Removal of these tags results in EKS being unable to create new load balancers.

Example:

$ kubectl create -f rest-api-deployment.yaml 
deployment.extensions/rest-api created
service/rest-api created

$ kubectl get services
NAME         TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)        AGE
kubernetes   ClusterIP      1.2.3.4          <none>                                                                    443/TCP        1h
rest-api     LoadBalancer   5.6.7.8          ae957d7b32bf611e9b5180a7f93f502e-1107504528.eu-west-1.elb.amazonaws.com   80:31137/TCP   3s

Then, after Terraform removes the tags:

$ kubectl delete -f rest-api-deployment.yaml 
deployment.extensions "rest-api" deleted
service "rest-api" deleted

$ kubectl create -f rest-api-deployment.yaml 
deployment.extensions/rest-api created
service/rest-api created

$ kubectl get services
NAME         TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
kubernetes   ClusterIP      1.2.3.4         <none>        443/TCP        1h
rest-api     LoadBalancer   5.6.7.8         <pending>     80:32672/TCP   4m

The LoadBalancer now stays in pending state indefinitely.

What's the expected behavior?

Should work out of the box.

Any other relevant info

After many hours fiddling around and wasting time comparing resources I realized what was happening. I then found this comment in another issue that describes a workaround: https://github.com/terraform-aws-modules/terraform-aws-eks/issues/183#issuecomment-444658609

While I realize that the removal of these tags is triggered by the VPC module and not this one, I expect EKS clusters provisioned with this module to work out of the box, but LoadBalancers do not work without further configuration. I think this needs to be highlighted in the documentation.

Stellar module otherwise! Just wish I hadn't lost this much time on it.

Most helpful comment

Relevant discussion: https://github.com/terraform-aws-modules/terraform-aws-vpc/issues/188

Related, non-generic workaround in a test fixture in this repo that requires also configuring local variables:
https://github.com/terraform-aws-modules/terraform-aws-eks/blob/cddac92757eb41a2069ba6fc02267cbf622e31a4/examples/eks_test_fixture/main.tf#L126

Since the above is too specific, I am now doing this as a workaround:

locals {
  eks_cluster_name = "foo-cluster"
}

Then in the VPC definition:

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"

  private_subnet_tags = {
    "kubernetes.io/cluster/${local.eks_cluster_name}" = "shared"
  }
  public_subnet_tags = {
    "kubernetes.io/cluster/${local.eks_cluster_name}" = "shared"
  }
  vpc_tags = {
    "kubernetes.io/cluster/${local.eks_cluster_name}" = "shared"
  }

  // ...
}

All 13 comments

For reference, here is AWS's documentation on all required tags: https://docs.aws.amazon.com/eks/latest/userguide/network_reqs.html

Relevant discussion: https://github.com/terraform-aws-modules/terraform-aws-vpc/issues/188

Related, non-generic workaround in a test fixture in this repo that requires also configuring local variables:
https://github.com/terraform-aws-modules/terraform-aws-eks/blob/cddac92757eb41a2069ba6fc02267cbf622e31a4/examples/eks_test_fixture/main.tf#L126

Since the above is too specific, I am now doing this as a workaround:

locals {
  eks_cluster_name = "foo-cluster"
}

Then in the VPC definition:

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"

  private_subnet_tags = {
    "kubernetes.io/cluster/${local.eks_cluster_name}" = "shared"
  }
  public_subnet_tags = {
    "kubernetes.io/cluster/${local.eks_cluster_name}" = "shared"
  }
  vpc_tags = {
    "kubernetes.io/cluster/${local.eks_cluster_name}" = "shared"
  }

  // ...
}

this should be documented somewhere in README

After many hours fiddling around and wasting time comparing resources I realized what was happening.

Sorry to hear!

this should be documented somewhere in README

Sure, feel free to make a PR.

Subnet tagging on AWS for Kubernetes (not specific to this module or EKS) for ELB subnets is well known but definitely poorly documented.

Even the AWS doc stating kubernetes.io/role/elb is new to me.

I would advise against @fubar's suggestion as it doesn't distinguish between public and private subnet tags. I think k8s could mix them up with this configuration? I have this:

module "vpc1" {
  source                       = "[email protected]:terraform-aws-modules/terraform-aws-vpc.git?ref=v1.37.0"
...
  private_subnets              = ["${local.private_cidr_blocks}"]
  public_subnets               = ["${local.public_cidr_blocks}"]

  public_subnet_tags = {
    "kubernetes.io/cluster/xxx-xxx01" = "shared"
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/xxx-xxx01" = "shared"
    "kubernetes.io/role/internal-elb"    = "true"
  }
}

I would advise against @fubar's suggestion as it doesn't distinguish between public and private subnet tags. I think k8s could mix them up with this configuration? I have this:

I'm not following. AWS sets the "shared" tag on all subnets that are associated with the EKS cluster, private and public. The solution I posted merely maintains that.

The problem is not with terraform-aws-eks module. It's AWS EKS services that add these tags to selected VPC and not a module.

While I realize that the removal of these tags is triggered by the VPC module and not this one, I expect EKS clusters provisioned with this module to work out of the box

Again, I don't think it's the responsibility of this module to take care of your VPC. I do think it should be documented in this module.

Weird, I've never seen tags placed on VPC resources by EKS service itself.

I just checked my VPCs for 3 clusters and there's no tag for "kubernetes.io/cluster/xxx" = "shared" at all. Also no tags on private subnets either. I only tag the public subnets in TF so that ELBs created by EKS have the right subnets.

@fubar

I'm not following. AWS sets the "shared" tag on all subnets that are associated with the EKS cluster, private and public. The solution I posted merely maintains that.

Does AWS add tags to distinguish between public and private subnets though? I've never seen this. What I mean is that if you have "kubernetes.io/cluster/xxx" = "shared" tag on both public and private then a ELB created by k8s could have mixed public/private subnets associated with it.

EKS adds tags to VPC and the subnets as these are required for the Kubernetes AWS provider to function correctly. Without tags you cannot get the most out of the LoadBalancer Service type.

Logic for figuring out in which subnets to create an ELB can be found here.

Basically:

  • finds valid subnets based on tags. If no valid tags found it uses the current subnet. This may not be what you want and one day will be an error
  • for public load balancers: filter out subnets that do not have public routing rules (no igw)
  • if there are multiple subnets in the same AZ (this is invalid for ELBs):

    • pick the subnet with tag indicating use for the ELB type. kubernetes.io/role/internal-elb for internal and kubernetes.io/role/elb for public

    • if tags do not exist or resolve the issue: pick the first ID lexicographically

Public ELBs always get created in subnets with an igw route defined.
Internal ELBs may end up in public or private subnets depending on what tags you have and the lexical sorting order of the subnets' IDs per AZ.

All,

I've ran into this same issue, but for completely different reasons, and I think there's something else going on. My subnets are all tagged properly, as is my VPC. I actually get the error from having an additional security group set via the worker_additional_security_group_ids variable. I am curious if anyone else receives this same issue.

I have it set to add the default VPC security group; I use this security group for VPN access. It's set to the following.

worker_additional_security_group_ids = ["${data.terraform_remote_state.vpc.default_security_group_id}"]

After spinning up a fresh cluster with this EKS module, I install a test service like the following:

---
apiVersion: v1
kind: Service
metadata:
  name: nginxhello
  namespace: default
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
  externalTrafficPolicy: Local
  type: LoadBalancer
  selector:
    app: nginxhello
  ports:
  - protocol: TCP
    port: 80
    name: http
    targetPort: 80
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginxhello
  namespace: default
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: nginxhello
    spec:
      containers:
      - image: nginxdemos/hello
        name: nginxhello
        ports:
        - containerPort: 80
          name: http
        livenessProbe:
          httpGet:
            path: /
            port: http
          initialDelaySeconds: 3
          periodSeconds: 3

Afterwards, I get the <pending> status from running kubectl get svc, even though AWS creates the load balancer just fine. Running a kubectl cluster-info dump, I search for the string CreatingLoadBalancerFailed, and see the following in the dump:

{
    "metadata": {
        "name": "nginxhello.158edc3d1f176193",
        "namespace": "default",
        "selfLink": "/api/v1/namespaces/default/events/nginxhello.158edc3d1f176193",
        "uid": "5e0e6cc7-4e1a-11e9-91f8-0a68d9cea174",
        "resourceVersion": "6326",
        "creationTimestamp": "2019-03-24T09:51:16Z"
    },
    "involvedObject": {
        "kind": "Service",
        "namespace": "default",
        "name": "nginxhello",
        "uid": "59ae1002-4e1a-11e9-91f8-0a68d9cea174",
        "apiVersion": "v1",
        "resourceVersion": "6290"
    },
    "reason": "CreatingLoadBalancerFailed",
    "message": "Error creating load balancer (will retry): failed to ensure load balancer for service default/nginxhello: Multiple tagged security groups found for instance i-028eb27d4d43b1ecf; ensure only the k8s security group is tagged; the tagged groups were sg-0bbee8c0e6cf2ed5c(my-eks-cluster20190324084411297900000007) sg-07773f214e065ee54(default) ",
    "source": {
        "component": "service-controller"
    },
    "firstTimestamp": "2019-03-24T09:51:16Z",
    "lastTimestamp": "2019-03-24T09:51:16Z",
    "count": 1,
    "type": "Warning",
    "eventTime": null,
    "reportingComponent": "",
    "reportingInstance": ""
}

If I comment out the worker_additional_security_group_ids, run terraform apply again, destroy all of the old nodes, let the new nodes spin up, and then delete and re-create the YAML manifest from above, it works fine.

The default security group was tagged with kubernetes.io/cluster/my-eks-cluster => shared and the worker security group created by this module was tagged with kubernetes.io/cluster/my-eks-cluster => owned. Removing the kubernetes.io/cluster/my-eks-cluster => shared tag from the default security group seems to do the trick.

Can anyone else recreate this?

Error creating load balancer (will retry): failed to ensure load balancer for service default/nginxhello: Multiple tagged security groups found for instance

Interesting. Never seen this error before! Thanks for the info.

@max-rocket-internet I found the problem. The error above was correct, but it came from the official VPC module. EKS automatically tags your VPC with the name of the cluster, but I was also adding the tag to vpc_tags in that module. Putting it there _also_ tags all of your subnets, and the default security group. Posting it here in case anyone runs into the same problem.

Closing after no update in a long time. Feel free to reopen 馃檪

for anyone else who has this problem, just try deleting/recreating the service.. Worked the second time. :0

Was this page helpful?
0 / 5 - 0 ratings