How is cluster DNS supposed to work? I have not been able to get pods to resolve any cluster addresses (including kubernetes.default) using EKS. I suspect it's a function of how the AWS VPC CNI works (or doesn't) and figured other people using this module must be running into the same problem however I can't seem to find much on the internet about this in EKS.
locals {
worker_groups = "${list(
map(
"name", "k8s-worker",
"ami_id", "ami-73a6e20b",
"asg_desired_capacity", "5",
"asg_max_size", "8",
"asg_min_size", "5",
"instance_type","m4.large",
"key_name", "${aws_key_pair.infra-deployer.key_name}"
),
)}"
tags = "${map("Environment", "${terraform.workspace}")}"
}
data "aws_vpc" "vpc" {
filter {
name = "tag:env"
values = ["${terraform.workspace}"]
}
filter {
name = "tag:Name"
values = ["${terraform.workspace}-us-west-2"]
}
}
data "aws_subnet_ids" "eks_subnets" {
vpc_id = "${data.aws_vpc.vpc.id}"
tags {
env = "${terraform.workspace}"
Name = "${terraform.workspace}-eks*"
}
}
module "eks" {
source = "terraform-aws-modules/eks/aws"
cluster_name = "${terraform.workspace}"
subnets = "${data.aws_subnet_ids.eks_subnets.ids}"
vpc_id = "${data.aws_vpc.vpc.id}"
kubeconfig_aws_authenticator_env_variables = "${map("AWS_PROFILE", "infra-deployer" )}"
map_accounts = ["${lookup(var.aws_account_ids, "prod")}"]
worker_groups = "${local.worker_groups}"
tags = "${local.tags}"
}
Trying DNS on a brand new cluster:
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 172.20.0.10
Address: 172.20.0.10:53
** server can't find kubernetes.default: NXDOMAIN
*** Can't find kubernetes.default: No answer
$ kubectl exec -ti busybox -- cat /etc/resolv.conf
nameserver 172.20.0.10
search default.svc.cluster.local svc.cluster.local cluster.local staging.thinklumo.com us-west-2.compute.internal
options ndots:5
md5-9a5adbdcec5f2625f4ec7ed5377f6775
Terraform v0.11.7
+ provider.aws v1.25.0
Hey @hobbsh . Thanks for the detailed issue 馃
I haven't actually tried this myself nor do I suspect the module here is the root cause. I would bet this is beyond the scope of any AWS docs as well.
@ozbillwang and @max-rocket-internet - have either of you done DNS within your EKS cluster yet?
Hey @brandoconnor - I am 99% positive it's not related to the module either but the severe lack of documentation/support for EKS right now is disappointing and made me resort to posting here. I also find it really hard to believe that I'm the only one who has run into this with EKS. Would appreciate anything at this point.
Edit: Maybe you or someone here can shed some light on where the 172.20.x.x addresses are coming from? The issue seems to be rooted there. Even if the userdata.sh used the 10.100.x.x there would still be a probelm. Is this an autoassigned cluster/pod-cidr by AWS? Given the pods/ENIs pull from the same subnet as the nodes, shouldn't the ClusterIPs as well?
First of all, probably nothing to do with this module. Second, if kube-dns is not working, then almost nothing will work.
I tried a test using busybox and get inconsistent results:
$ kubectl run --rm -i --tty --image=busybox temp --restart=Never -- sh
If you don't see a command prompt, try pressing enter.
/ # nslookup kubernetes.default.svc.cluster.local
Server: 172.20.0.10
Address: 172.20.0.10:53
*** Can't find kubernetes.default.svc.cluster.local: No answer
/ #
/ # nslookup kubernetes.default.svc.cluster.local
Server: 172.20.0.10
Address: 172.20.0.10:53
Non-authoritative answer:
Name: kubernetes.default.svc.cluster.local
Address: 172.20.0.1
^C
/ # nslookup ingress1-nginx-ingress-default-backend
Server: 172.20.0.10
Address: 172.20.0.10:53
** server can't find ingress1-nginx-ingress-default-backend: NXDOMAIN
^C
/ # nslookup ingress1-nginx-ingress-default-backend
Server: 172.20.0.10
Address: 172.20.0.10:53
** server can't find ingress1-nginx-ingress-default-backend: NXDOMAIN
^C
/ # nslookup ingress1-nginx-ingress-defa^C
/ # nslookup ingress1-nginx-ingress-default-backend.default.svc.cluster.local
Server: 172.20.0.10
Address: 172.20.0.10:53
Name: ingress1-nginx-ingress-default-backend.default.svc.cluster.local
Address: 172.20.126.10
See it resolves sometimes but not others? Strange. I don't know why.
Anyway, with an ubuntu image, it works fine:
$ kubectl run --rm -i --tty --image=ubuntu temp --restart=Never -- bash
If you don't see a command prompt, try pressing enter.
root@temp:/# getent hosts kubernetes
172.20.0.1 kubernetes.default.svc.cluster.local
root@temp:/# getent hosts ingress1-nginx-ingress-default-backend
172.20.126.10 ingress1-nginx-ingress-default-backend.default.svc.cluster.local
root@temp:/#
@max-rocket-internet First of all thanks for replying and sorry to taint this repo with an unrelated thread - I appreciate the support! This all started with my grafana pod (prometheus-operator) not being able to find the datasource via the prometheus cluster hostname.
I woke up this morning and of course now it's working. However something is still up with busybox. Now it's even more frustrating that there's not a concrete answer for this...
$ kubectl exec -it busybox -- nslookup kubernetes.default
Server: 172.20.0.10
Address: 172.20.0.10:53
** server can't find kubernetes.default: NXDOMAIN
*** Can't find kubernetes.default: No answer
root@kube-prometheus-grafana-7c44bfbb84-5ghc4:/# nslookup kube-prometheus.monitoring
Server: 172.20.0.10
Address: 172.20.0.10#53
Name: kube-prometheus.monitoring.svc.cluster.local
Address: 172.20.49.204
Edit:
I take that back about nothing being changed - I had changed the docker.service file on all the nodes to use the cluster DNS - rebuilding again to verify:
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target docker.socket
Wants=docker.socket
[Service]
Type=notify
Environment=GOTRACEBACK=crash
ExecReload=/bin/kill -s HUP $MAINPID
Delegate=yes
KillMode=process
ExecStart=/usr/bin/dockerd \
--dns 172.20.0.10 \
--dns-search default.svc.cluster.local --dns-search svc.cluster.local --dns-search staging.thinklumo.com \
--dns-opt ndots:3 --dns-opt timeout:2 --dns-opt attempts:2
TasksMax=infinity
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
TimeoutStartSec=1min
# restart the docker process if it exits prematurely
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s
[Install]
WantedBy=multi-user.target
Sure enough, it works now. Can't explain it, but I'll get off your lawn now. Thanks again.
Thanks for the assist @max-rocket-internet . I think we all lament the need to get support through channels like this... All I ask is we make the most of the situation and share here what was done to resolve the problems when they're resolved, which you've done in spades @hobbsh . Thanks for the wrap up!
@hobbsh - Can the module do anything to lessen the pain/confusion? Should we look to do more in userdata to alter the configurations as you did?
@brandoconnor I have some things I've run into that I will hopefully have time to PR in the next few days - a lot of my customizations have either been custom AMIs or running ansible after the fact and haven't really taken a step back yet. Also, at this point it looks like my docker customizations did not help so :shrug:
So as of right now, thank you to all the contributors for at least making creating a cluster painless!
Just a follow-up - I think I was hitting this issue: https://github.com/docker/libnetwork/issues/2187.
FWIW, the DNS service at 172.20.0.10 stems from this file: /etc/eks/bootstrap.sh
.
.
.
INTERNAL_IP=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)
INSTANCE_TYPE=$(curl -s http://169.254.169.254/latest/meta-data/instance-type)
DNS_CLUSTER_IP=10.100.0.10
if [[ $INTERNAL_IP == 10.* ]] ; then
DNS_CLUSTER_IP=172.20.0.10;
fi
.
.
.
Most helpful comment
Hey @brandoconnor - I am 99% positive it's not related to the module either but the severe lack of documentation/support for EKS right now is disappointing and made me resort to posting here. I also find it really hard to believe that I'm the only one who has run into this with EKS. Would appreciate anything at this point.
Edit: Maybe you or someone here can shed some light on where the 172.20.x.x addresses are coming from? The issue seems to be rooted there. Even if the userdata.sh used the 10.100.x.x there would still be a probelm. Is this an autoassigned cluster/pod-cidr by AWS? Given the pods/ENIs pull from the same subnet as the nodes, shouldn't the ClusterIPs as well?