kops version are you running? The command kops version, will display$ kops version
Version 1.9.1
kubectl version will print thekops flag.$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-05-12T04:12:47Z", GoVersion:"go1.9.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
What cloud provider are you using?
AWS
What commands did you run? What is the simplest way to reproduce this issue?
Created a cluster within existing empty aws default vpc
export VPC_ID=xxx
export SUBNET_IDS=subnet-xxx
export KOPS_STATE_STORE=s3://xxxx.k8s.local
export NAME=xxxx.k8s.local
kops create cluster --state=${KOPS_STATE_STORE} --cloud=aws --zones=us-east-1a --node-count=2 --node-size=t2.medium --master-size=t2.small ${NAME} --subnets=${SUBNET_IDS}
After waiting for the cluster, got two nodes as expected:
$ kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ip-172-31-82-164.ec2.internal Ready master 22m v1.9.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.small,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1a,kops.k8s.io/instancegroup=master-us-east-1a,kubernetes.io/hostname=ip-172-31-82-164.ec2.internal,kubernetes.io/role=master,node-role.kubernetes.io/master=
ip-172-31-91-241.ec2.internal Ready node 21m v1.9.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.medium,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1a,kops.k8s.io/instancegroup=nodes,kubernetes.io/hostname=ip-172-31-91-241.ec2.internal,kubernetes.io/role=node,node-role.kubernetes.io/node=
ip-172-31-94-234.ec2.internal Ready node 22m v1.9.6 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t2.medium,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-1,failure-domain.beta.kubernetes.io/zone=us-east-1a,kops.k8s.io/instancegroup=nodes,kubernetes.io/hostname=ip-172-31-94-234.ec2.internal,kubernetes.io/role=node,node-role.kubernetes.io/node=
However it seems like DNS isn't working. nslookup in busybox fails for both nodes:
$ kubectl delete po busybox; kubectl run -i --tty busybox --image=busybox --restart=Never --overrides='{ "apiVersion": "v1", "spec": { "nodeSelector": {"kubernetes.io/hostname":"ip-172-31-91-241.ec2.internal"} } }' -- nslookup kubernetes.default
pod "busybox" deleted
If you don't see a command prompt, try pressing enter.
Address 1: 100.64.0.10 kube-dns.kube-system.svc.cluster.local
nslookup: can't resolve 'kubernetes.default'
pod default/busybox terminated (Error)
$ kubectl delete po busybox; kubectl run -i --tty busybox --image=busybox --restart=Never --overrides='{ "apiVersion": "v1", "spec": { "nodeSelector": {"kubernetes.io/hostname":"ip-172-31-94-234.ec2.internal"} } }' -- nslookup kubernetes.default
pod "busybox" deleted
If you don't see a command prompt, try pressing enter.
Address 1: 100.64.0.10
nslookup: can't resolve 'kubernetes.default'
pod default/busybox terminated (Error)
Sporadically the command succeeds:
Name: kubernetes.default
Address 1: 100.64.0.1 kubernetes.default.svc.cluster.local
What happened after the commands executed?
See above, dns does not seem to be working. Sporadically a succesful
What did you expect to happen?
nslookup should work
Please provide your cluster manifest. Execute
kops get --name my.example.com -oyaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2018-05-31T20:53:25Z
name: xxx.k8s.local
spec:
api:
loadBalancer:
type: Public
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://xxx/xxx.k8s.local
etcdClusters:
- etcdMembers:
- instanceGroup: master-us-east-1a
name: a
name: main
- etcdMembers:
- instanceGroup: master-us-east-1a
name: a
name: events
iam:
allowContainerRegistry: true
legacy: false
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.9.6
masterPublicName: xxx.k8s.local
networkCIDR: 172.31.0.0/16
networkID: xxx
networking:
kubenet: {}
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 172.31.80.0/20
id: xxx
name: us-east-1a
type: Public
zone: us-east-1a
topology:
dns:
type: Public
masters: public
nodes: public
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-05-31T20:53:25Z
labels:
kops.k8s.io/cluster: xxx.local
name: master-us-east-1a
spec:
image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
machineType: t2.small
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-us-east-1a
role: Master
subnets:
- us-east-1a
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-05-31T20:53:25Z
labels:
kops.k8s.io/cluster: xxx.k8s.local
name: nodes
spec:
image: kope.io/k8s-1.9-debian-jessie-amd64-hvm-ebs-2018-03-11
machineType: t2.medium
maxSize: 2
minSize: 2
nodeLabels:
kops.k8s.io/instancegroup: nodes
role: Node
subnets:
- us-east-1a
鈽濓笍 as a side note, which should prolly be a separate issue. I think kops validate should catch these dns failures
I'm running into a similar problem. Have you been able to figure this one out ?
@jueast08 unf not - it was fine in my case to create a new vpc instead so settled for that solution
I am having the same issue.
=>> kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.0", GitCommit:"0ed33881dc4355495f623c6f22e7dd0b7632b7c0", GitTreeState:"clean", BuildDate:"2018-09-28T15:20:58Z", GoVersion:"go1.11", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:05:37Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
=>> kops version
Version 1.10.0
Also having the same issue
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.0", GitCommit:"0ed33881dc4355495f623c6f22e7dd0b7632b7c0", GitTreeState:"clean", BuildDate:"2018-09-28T15:20:58Z", GoVersion:"go1.11", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.6", GitCommit:"a21fdbd78dde8f5447f5f6c331f7eb6f80bd684e", GitTreeState:"clean", BuildDate:"2018-07-26T10:04:08Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Specifically, it looks a newly-created pod does not have a route to the kube-dns service via the default gateway.
using netshoot pod: kubectl run -i --tty netshoot --image=nicolaka/netshoot --restart=Never -- sh
kube-dns service route
traceroute to 100.64.0.10 (100.64.0.10), 30 hops max, 46 byte packets
1 100.96.1.1 (100.96.1.1) 0.004 ms 0.006 ms 0.002 ms
2 * * *
3 * * *
4
but can route to google.com:
/ # traceroute 216.58.194.174
traceroute to 216.58.194.174 (216.58.194.174), 30 hops max, 46 byte packets
1 100.96.1.1 (100.96.1.1) 0.005 ms 0.006 ms 0.002 ms
2 50.112.0.94 (50.112.0.94) 5.286 ms 50.112.0.116 (50.112.0.116) 21.533 ms^C
routing table
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default 100.96.1.1 0.0.0.0 UG 0 0 0 eth0
100.96.1.0 * 255.255.255.0 U 0 0 0 eth0
/etc/resolv.conf
nameserver 100.64.0.10
search default.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
options ndots:5
The kube-dns containers themselves appear to be healthy:
I1101 21:29:21.707872 1 main.go:51] Version v1.14.8.3
I1101 21:29:21.707977 1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1} {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})
I1101 21:29:21.708065 1 dnsprobe.go:75] Starting dnsProbe {Label:kubedns Server:127.0.0.1:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
I1101 21:29:21.708732 1 dnsprobe.go:75] Starting dnsProbe {Label:dnsmasq Server:127.0.0.1:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:1}
1101 21:29:19.846201 1 main.go:74] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --dns-forward-max=150 --no-negcache --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053] true} /etc/k8s/dns/dnsmasq-nanny 10000000000}
I1101 21:29:19.846524 1 nanny.go:94] Starting dnsmasq [-k --cache-size=1000 --dns-forward-max=150 --no-negcache --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053]
I1101 21:29:20.118142 1 nanny.go:116] dnsmasq[9]: started, version 2.78 cachesize 1000
I1101 21:29:20.118171 1 nanny.go:116] dnsmasq[9]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I1101 21:29:20.118177 1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain in6.arpa
I1101 21:29:20.118234 1 nanny.go:119]
W1101 21:29:20.118245 1 nanny.go:120] Got EOF from stdout
I1101 21:29:20.118182 1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I1101 21:29:20.118266 1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain cluster.local
I1101 21:29:20.118274 1 nanny.go:116] dnsmasq[9]: reading /etc/resolv.conf
I1101 21:29:20.118285 1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain in6.arpa
I1101 21:29:20.118289 1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I1101 21:29:20.118293 1 nanny.go:116] dnsmasq[9]: using nameserver 127.0.0.1#10053 for domain cluster.local
I1101 21:29:20.118298 1 nanny.go:116] dnsmasq[9]: using nameserver 172.31.0.2#53
I1101 21:29:20.118351 1 nanny.go:116] dnsmasq[9]: read /etc/hosts - 7 addresses
I1101 21:29:17.694858 1 dns.go:48] version: 1.14.10
I1101 21:29:17.695547 1 server.go:69] Using configuration read from directory: /kube-dns-config with period 10s
I1101 21:29:17.695624 1 server.go:121] FLAG: --alsologtostderr="false"
I1101 21:29:17.695639 1 server.go:121] FLAG: --config-dir="/kube-dns-config"
I1101 21:29:17.695645 1 server.go:121] FLAG: --config-map=""
I1101 21:29:17.695649 1 server.go:121] FLAG: --config-map-namespace="kube-system"
I1101 21:29:17.695653 1 server.go:121] FLAG: --config-period="10s"
I1101 21:29:17.695692 1 server.go:121] FLAG: --dns-bind-address="0.0.0.0"
I1101 21:29:17.695696 1 server.go:121] FLAG: --dns-port="10053"
I1101 21:29:17.695717 1 server.go:121] FLAG: --domain="cluster.local."
I1101 21:29:17.695726 1 server.go:121] FLAG: --federations=""
I1101 21:29:17.695731 1 server.go:121] FLAG: --healthz-port="8081"
I1101 21:29:17.695736 1 server.go:121] FLAG: --initial-sync-timeout="1m0s"
I1101 21:29:17.695740 1 server.go:121] FLAG: --kube-master-url=""
I1101 21:29:17.695745 1 server.go:121] FLAG: --kubecfg-file=""
I1101 21:29:17.695749 1 server.go:121] FLAG: --log-backtrace-at=":0"
I1101 21:29:17.695755 1 server.go:121] FLAG: --log-dir=""
I1101 21:29:17.695759 1 server.go:121] FLAG: --log-flush-frequency="5s"
I1101 21:29:17.695763 1 server.go:121] FLAG: --logtostderr="true"
I1101 21:29:17.695767 1 server.go:121] FLAG: --nameservers=""
I1101 21:29:17.695770 1 server.go:121] FLAG: --stderrthreshold="2"
I1101 21:29:17.695774 1 server.go:121] FLAG: --v="2"
I1101 21:29:17.695778 1 server.go:121] FLAG: --version="false"
I1101 21:29:17.695784 1 server.go:121] FLAG: --vmodule=""
I1101 21:29:17.695869 1 server.go:169] Starting SkyDNS server (0.0.0.0:10053)
I1101 21:29:17.696238 1 server.go:179] Skydns metrics enabled (/metrics:10055)
I1101 21:29:17.696252 1 dns.go:188] Starting endpointsController
I1101 21:29:17.696256 1 dns.go:191] Starting serviceController
I1101 21:29:17.696331 1 dns.go:184] Configuration updated: {TypeMeta:{Kind: APIVersion:} Federations:map[] StubDomains:map[] UpstreamNameservers:[]}
I1101 21:29:17.696396 1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I1101 21:29:17.696408 1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I1101 21:29:18.196571 1 dns.go:222] Initialized services and endpoints from apiserver
I1101 21:29:18.196596 1 server.go:137] Setting up Healthz Handler (/readiness)
I1101 21:29:18.196608 1 server.go:142] Setting up cache handler (/cache)
I1101 21:29:18.196615 1 server.go:128] Status HTTP port 8081
I was struggling with this exact same issue and finally cracked it. In my case, Kubernetes was unable to determine which AWS route table to interact with to create routes on-the-fly.
Check the output of the kub-controller-manager-ip pod in the cluster:
kubectl get pods --namespace=kube-system
Locate the correct pod and check the logs:
kubectl logs kube-controller-manager-ip-... --namespace=kube-system
I had this error being generated over and over:
E0111 17:35:15.121422 1 route_controller.go:117] Couldn't reconcile node routes: error listing routes: found multiple matching AWS route tables for AWS cluster: kubernetes.example.com
For me, it was because I had multiple route tables defined in AWS with the tag KubernetesCluster and value kubernetes.example.com (one route table for the public subnets and one for private subnets). As soon as I ensured that my master and node instances were in the public subnets only and removed the KubernetesCluster tag from the private subnet, the problem immediately went away.
Hope that helps.
Good catch @danielgrant! I resolved my issue but actually not going the existing VPC route and instead peering the newly-created VPC with my existing VPC, which actually feels like a cleaner solution IMO. But this is still invaluable info. 馃憦
We faced a very similar issue when deploying into an existing VPC with two route tables. In our case, the route tables were already existing when we created the cluster with kops, and kops did not apply the tag to either of them. In this case our error looked like
E0311 14:22:50.957633 1 route_controller.go:120] Couldn't reconcile node routes: error listing routes: unable to find route table for AWS cluster: foo.k8s.local
Manually applying the tag to the right routing table caused the cluster to recover with no more intervention.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
We faced a very similar issue when deploying into an existing VPC with two route tables. In our case, the route tables were already existing when we created the cluster with kops, and kops did not apply the tag to either of them. In this case our error looked like
E0311 14:22:50.957633 1 route_controller.go:120] Couldn't reconcile node routes: error listing routes: unable to find route table for AWS cluster: foo.k8s.localManually applying the tag to the right routing table caused the cluster to recover with no more intervention.