What happened?
Cluster DNS resolution isn't working on a nodegroup:
I deployed an EKS cluster as follows:
eksctl create cluster --name foo --tags "key=val" --region us-east-1 --zones us-east-1a,us-east-1b --nodegroup-name foo --node-type m5.large --nodes-min 2 --nodes-max 4 --ssh-access --ssh-public-key=foo --node-ami auto --node-private-networking --node-labels "partition=foo" --asg-access --cfn-role-arn arn:aws:iam::xyz
Then deployed a nodegroup as follows:
eksctl create nodegroup --cluster foo --region us-east-1 --name foo-bar --node-type m5.large --nodes 1 --nodes-min 1 --nodes-max 2 --ssh-access --ssh-public-key=foo --node-ami auto --node-private-networking --node-labels "partition=bar" --asg-access --cfn-role-arn arn:aws:iam::xyz
Then ran kubectl create -f with this yaml.
Then ran kubectl exec -ti busybox -- nslookup kubernetes.default to test DNS on a partition: foo node and the output is OK:
Server: 10.100.0.10
Address 1: 10.100.0.10 ip-10-100-0-10.ec2.internal
Name: kubernetes.default
Address 1: 10.100.0.1 ip-10-100-0-1.ec2.internal
But after modifying the pod yaml to run on a partition: bar node, the above command fails:
Server: 10.100.0.10
Address 1: 10.100.0.10
nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1
What you expected to happen?
I expected the pod on the "bar" node to be able to resolve cluster DNS.
How to reproduce it?
See above
Anything else we need to know?
eksctl 0.1.17 (latest) installed via homebrew on OSX Mojave.
Versions
Please paste in the output of these commands:
$ eksctl version
[ℹ] version.Info{BuiltAt:"", GitCommit:"", GitTag:"0.1.17"}
$ uname -a
Darwin dschott-mbp.local 18.2.0 Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/RELEASE_X86_64 x86_64
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T19:44:19Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.5-eks-6bad6d", GitCommit:"6bad6d9c768dc0864dab48a11653aa53b5a47043", GitTreeState:"clean", BuildDate:"2018-12-06T23:13:14Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Also include your version of heptio-authenticator-aws
weaveworks/tap/eksctl-aws-iam-authenticator: stable 0.3.0
A tool to use AWS IAM credentials to authenticate to a Kubernetes cluster
https://github.com/kubernetes-sigs/aws-iam-authenticator
/usr/local/Cellar/eksctl-aws-iam-authenticator/0.3.0 (3 files, 17.5MB) *
Built from source on 2018-12-26 at 09:10:51
From: https://github.com/weaveworks/homebrew-tap/blob/master/Formula/eksctl-aws-iam-authenticator.rb
Logs
See above
For the future, I wouldn't recommend using busybox or alpine for any DNS tests, as there are differences in glibc vs musl, I would recommend testing with e.g. Ubuntu and Alpine. Did you try not using --node-ami auto and just use the default AMI? I can see there was an update and will certainly open a PR to add those new AMIs.
I really don't see any difference between the two nodegroups you have. I'm going to try reproducing it, but I just cannot see how it can be the case, unless it's just a random flake and nothing to do with the fact that there are two nodegroups. Also, for the future, we have --config-file node, and there is a multi-NG example you might want to look at.
I can confirm that there is an issue, I just ran an Ubuntu image as a daemonset, and this is what I'm seeing.
[0] >> kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
dns-test-2jwl5 1/1 Running 0 1m 192.168.123.159 ip-192-168-110-65.ec2.internal <none>
dns-test-kjmzm 1/1 Running 0 1m 192.168.83.26 ip-192-168-88-246.ec2.internal <none>
dns-test-mcktz 1/1 Running 0 1m 192.168.95.23 ip-192-168-76-4.ec2.internal <none>
[0] >> for i in dns-test-2jwl5 dns-test-kjmzm dns-test-mcktz ; do kubectl exec -ti $i dig -- +short kubernetes.default.svc.cluster.local ; done
10.100.0.1
10.100.0.1
;; connection timed out; no servers could be reached
command terminated with exit code 9
[9] >>
I believe this is to do with that nodegroups are actually isolated in separate security groups, so we need to discuss consider a few relatively significant changes. Thanks a lot for reporting this. We might want to cover this in the integration tests actually as well.
I will try to come up with a fix as soon as I can, as this implies that cluster with more then on nodegroup has broken DNS.
If anyone needs to fix a running cluster in the meantime, you can patch the security groups of each of the nodegroups to allow ingress on TCP & UDP port 53. Alternatively you can use only one nodegroup (no matter which one), or convert the DNS deployment into a deamonset.
That was quick! Thanks so much @errordeveloper for looking into it.
Aiming to cut the release tomorrow.
0.1.18 is out now.