Incomplete documentation
apply'ing a standard Terraform set of VPC and EKS definitions erases VPC and subnet tags added by AWS, resulting in Kubernetes being unable to create load balancers.
Consider a standard VPC and EKS definition like this:
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
name = "foo-vpc"
cidr = "10.0.0.0/16"
azs = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
private_subnets = ["10.0.64.0/19", "10.0.96.0/19", "10.0.128.0/19"]
public_subnets = ["10.0.160.0/19", "10.0.192.0/19", "10.0.224.0/19"]
enable_nat_gateway = true
enable_vpn_gateway = true
enable_dns_hostnames = true
}
module "foo-cluster" {
cluster_name = "foo-cluster"
source = "terraform-aws-modules/eks/aws"
subnets = ["${module.vpc.public_subnets}"]
vpc_id = "${module.vpc.vpc_id}"
worker_groups = [
{
asg_max_size = 2
instance_type = "t3.small"
}
]
}
Run terraform apply
Run terraform apply again. This time it wants to make these changes:
~ module.foo.module.vpc.aws_subnet.public[0]
tags.%: "2" => "1"
tags.kubernetes.io/cluster/foo-cluster: "shared" => ""
~ module.foo.module.vpc.aws_subnet.public[1]
tags.%: "2" => "1"
tags.kubernetes.io/cluster/foo-cluster: "shared" => ""
~ module.foo.module.vpc.aws_subnet.public[2]
tags.%: "2" => "1"
tags.kubernetes.io/cluster/foo-cluster: "shared" => ""
~ module.foo.module.vpc.aws_vpc.this
tags.%: "2" => "1"
tags.kubernetes.io/cluster/foo-cluster: "shared" => ""
Removal of these tags results in EKS being unable to create new load balancers.
Example:
$ kubectl create -f rest-api-deployment.yaml
deployment.extensions/rest-api created
service/rest-api created
$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 1.2.3.4 <none> 443/TCP 1h
rest-api LoadBalancer 5.6.7.8 ae957d7b32bf611e9b5180a7f93f502e-1107504528.eu-west-1.elb.amazonaws.com 80:31137/TCP 3s
Then, after Terraform removes the tags:
$ kubectl delete -f rest-api-deployment.yaml
deployment.extensions "rest-api" deleted
service "rest-api" deleted
$ kubectl create -f rest-api-deployment.yaml
deployment.extensions/rest-api created
service/rest-api created
$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 1.2.3.4 <none> 443/TCP 1h
rest-api LoadBalancer 5.6.7.8 <pending> 80:32672/TCP 4m
The LoadBalancer now stays in pending state indefinitely.
Should work out of the box.
After many hours fiddling around and wasting time comparing resources I realized what was happening. I then found this comment in another issue that describes a workaround: https://github.com/terraform-aws-modules/terraform-aws-eks/issues/183#issuecomment-444658609
While I realize that the removal of these tags is triggered by the VPC module and not this one, I expect EKS clusters provisioned with this module to work out of the box, but LoadBalancers do not work without further configuration. I think this needs to be highlighted in the documentation.
Stellar module otherwise! Just wish I hadn't lost this much time on it.
For reference, here is AWS's documentation on all required tags: https://docs.aws.amazon.com/eks/latest/userguide/network_reqs.html
Relevant discussion: https://github.com/terraform-aws-modules/terraform-aws-vpc/issues/188
Related, non-generic workaround in a test fixture in this repo that requires also configuring local variables:
https://github.com/terraform-aws-modules/terraform-aws-eks/blob/cddac92757eb41a2069ba6fc02267cbf622e31a4/examples/eks_test_fixture/main.tf#L126
Since the above is too specific, I am now doing this as a workaround:
locals {
eks_cluster_name = "foo-cluster"
}
Then in the VPC definition:
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
private_subnet_tags = {
"kubernetes.io/cluster/${local.eks_cluster_name}" = "shared"
}
public_subnet_tags = {
"kubernetes.io/cluster/${local.eks_cluster_name}" = "shared"
}
vpc_tags = {
"kubernetes.io/cluster/${local.eks_cluster_name}" = "shared"
}
// ...
}
this should be documented somewhere in README
After many hours fiddling around and wasting time comparing resources I realized what was happening.
Sorry to hear!
this should be documented somewhere in README
Sure, feel free to make a PR.
Subnet tagging on AWS for Kubernetes (not specific to this module or EKS) for ELB subnets is well known but definitely poorly documented.
Even the AWS doc stating kubernetes.io/role/elb is new to me.
I would advise against @fubar's suggestion as it doesn't distinguish between public and private subnet tags. I think k8s could mix them up with this configuration? I have this:
module "vpc1" {
source = "[email protected]:terraform-aws-modules/terraform-aws-vpc.git?ref=v1.37.0"
...
private_subnets = ["${local.private_cidr_blocks}"]
public_subnets = ["${local.public_cidr_blocks}"]
public_subnet_tags = {
"kubernetes.io/cluster/xxx-xxx01" = "shared"
}
private_subnet_tags = {
"kubernetes.io/cluster/xxx-xxx01" = "shared"
"kubernetes.io/role/internal-elb" = "true"
}
}
I would advise against @fubar's suggestion as it doesn't distinguish between public and private subnet tags. I think k8s could mix them up with this configuration? I have this:
I'm not following. AWS sets the "shared" tag on all subnets that are associated with the EKS cluster, private and public. The solution I posted merely maintains that.
The problem is not with terraform-aws-eks module. It's AWS EKS services that add these tags to selected VPC and not a module.
While I realize that the removal of these tags is triggered by the VPC module and not this one, I expect EKS clusters provisioned with this module to work out of the box
Again, I don't think it's the responsibility of this module to take care of your VPC. I do think it should be documented in this module.
Weird, I've never seen tags placed on VPC resources by EKS service itself.
I just checked my VPCs for 3 clusters and there's no tag for "kubernetes.io/cluster/xxx" = "shared" at all. Also no tags on private subnets either. I only tag the public subnets in TF so that ELBs created by EKS have the right subnets.
@fubar
I'm not following. AWS sets the "shared" tag on all subnets that are associated with the EKS cluster, private and public. The solution I posted merely maintains that.
Does AWS add tags to distinguish between public and private subnets though? I've never seen this. What I mean is that if you have "kubernetes.io/cluster/xxx" = "shared" tag on both public and private then a ELB created by k8s could have mixed public/private subnets associated with it.
EKS adds tags to VPC and the subnets as these are required for the Kubernetes AWS provider to function correctly. Without tags you cannot get the most out of the LoadBalancer Service type.
Logic for figuring out in which subnets to create an ELB can be found here.
Basically:
igw)kubernetes.io/role/internal-elb for internal and kubernetes.io/role/elb for publicPublic ELBs always get created in subnets with an igw route defined.
Internal ELBs may end up in public or private subnets depending on what tags you have and the lexical sorting order of the subnets' IDs per AZ.
All,
I've ran into this same issue, but for completely different reasons, and I think there's something else going on. My subnets are all tagged properly, as is my VPC. I actually get the error from having an additional security group set via the worker_additional_security_group_ids variable. I am curious if anyone else receives this same issue.
I have it set to add the default VPC security group; I use this security group for VPN access. It's set to the following.
worker_additional_security_group_ids = ["${data.terraform_remote_state.vpc.default_security_group_id}"]
After spinning up a fresh cluster with this EKS module, I install a test service like the following:
---
apiVersion: v1
kind: Service
metadata:
name: nginxhello
namespace: default
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
externalTrafficPolicy: Local
type: LoadBalancer
selector:
app: nginxhello
ports:
- protocol: TCP
port: 80
name: http
targetPort: 80
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginxhello
namespace: default
spec:
replicas: 1
template:
metadata:
labels:
app: nginxhello
spec:
containers:
- image: nginxdemos/hello
name: nginxhello
ports:
- containerPort: 80
name: http
livenessProbe:
httpGet:
path: /
port: http
initialDelaySeconds: 3
periodSeconds: 3
Afterwards, I get the <pending> status from running kubectl get svc, even though AWS creates the load balancer just fine. Running a kubectl cluster-info dump, I search for the string CreatingLoadBalancerFailed, and see the following in the dump:
{
"metadata": {
"name": "nginxhello.158edc3d1f176193",
"namespace": "default",
"selfLink": "/api/v1/namespaces/default/events/nginxhello.158edc3d1f176193",
"uid": "5e0e6cc7-4e1a-11e9-91f8-0a68d9cea174",
"resourceVersion": "6326",
"creationTimestamp": "2019-03-24T09:51:16Z"
},
"involvedObject": {
"kind": "Service",
"namespace": "default",
"name": "nginxhello",
"uid": "59ae1002-4e1a-11e9-91f8-0a68d9cea174",
"apiVersion": "v1",
"resourceVersion": "6290"
},
"reason": "CreatingLoadBalancerFailed",
"message": "Error creating load balancer (will retry): failed to ensure load balancer for service default/nginxhello: Multiple tagged security groups found for instance i-028eb27d4d43b1ecf; ensure only the k8s security group is tagged; the tagged groups were sg-0bbee8c0e6cf2ed5c(my-eks-cluster20190324084411297900000007) sg-07773f214e065ee54(default) ",
"source": {
"component": "service-controller"
},
"firstTimestamp": "2019-03-24T09:51:16Z",
"lastTimestamp": "2019-03-24T09:51:16Z",
"count": 1,
"type": "Warning",
"eventTime": null,
"reportingComponent": "",
"reportingInstance": ""
}
If I comment out the worker_additional_security_group_ids, run terraform apply again, destroy all of the old nodes, let the new nodes spin up, and then delete and re-create the YAML manifest from above, it works fine.
The default security group was tagged with kubernetes.io/cluster/my-eks-cluster => shared and the worker security group created by this module was tagged with kubernetes.io/cluster/my-eks-cluster => owned. Removing the kubernetes.io/cluster/my-eks-cluster => shared tag from the default security group seems to do the trick.
Can anyone else recreate this?
Error creating load balancer (will retry): failed to ensure load balancer for service default/nginxhello: Multiple tagged security groups found for instance
Interesting. Never seen this error before! Thanks for the info.
@max-rocket-internet I found the problem. The error above was correct, but it came from the official VPC module. EKS automatically tags your VPC with the name of the cluster, but I was also adding the tag to vpc_tags in that module. Putting it there _also_ tags all of your subnets, and the default security group. Posting it here in case anyone runs into the same problem.
Closing after no update in a long time. Feel free to reopen 馃檪
for anyone else who has this problem, just try deleting/recreating the service.. Worked the second time. :0
Most helpful comment
Relevant discussion: https://github.com/terraform-aws-modules/terraform-aws-vpc/issues/188
Related, non-generic workaround in a test fixture in this repo that requires also configuring
localvariables:https://github.com/terraform-aws-modules/terraform-aws-eks/blob/cddac92757eb41a2069ba6fc02267cbf622e31a4/examples/eks_test_fixture/main.tf#L126
Since the above is too specific, I am now doing this as a workaround:
Then in the VPC definition: