I have a k3s cluster that has been running fine for some time but suddenly started having problems with DNS and/or networking. Unfortunately I haven't been able to determine what caused it or even what exactly the problem is.
This issue seems related but according to that it should be enough to change the coredns ConfigMap and that should already be fixed in this release of k3s.
The first sign of trouble was that metrics-server didn't report metrics for nodes. I found out that it was because it couldn't fully scrape metrics and timed out. Further investigation lead me to believe that it wasn't able to resolve the nodes hostnames.
To work around the first problem, I added the flags --kubelet-insecure-tls and --kubelet-preferred-address-types=InternalIP. It works but I don't like it, it was working fine before without this.
After this, I realized that this problem was not isolated to metrics-server. Other pods in the cluster are also unable to resolve any hostnames (cluster services or public). I haven't been able to find a pattern to it. The cert-manager pod can resolve everything correctly, but my test pods cannot resolve anything no matter what host they run on, same as metrics-server.
I guess it is relevant also to note that I can reach the internet just fine and lookup any public domain names on the nodes directly.
I have also tried changing the coredns ConfigMap to use 8.8.8.8 instead of /etc/resolv.conf.
System description
The cluster consists of 3 Raspberry Pis running Fedora IoT.
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
fili Ready master 104d v1.14.1-k3s.4 10.0.0.13 <none> Fedora 29.20190606.0 (IoT Edition) 5.1.6-200.fc29.aarch64 containerd://1.2.5+unknown
kili Ready <none> 97d v1.14.1-k3s.4 10.0.0.15 <none> Fedora 29.20190606.0 (IoT Edition) 5.1.6-200.fc29.aarch64 containerd://1.2.5+unknown
pippin Ready <none> 41d v1.14.1-k3s.4 10.0.0.2 <none> Fedora 29.20190606.0 (IoT Edition) 5.1.6-200.fc29.aarch64 containerd://1.2.5+unknown
Relevant logs
CoreDNS logs messages like the following when one of the pods is trying to reach a service in another namespace (gitea):
2019-06-14T16:36:16.234Z [ERROR] plugin/errors: 2 gitea.gitea. AAAA: unreachable backend: read udp 10.42.4.93:49037->10.0.0.1:53: i/o timeout
2019-06-14T16:36:16.234Z [ERROR] plugin/errors: 2 gitea.gitea. A: unreachable backend: read udp 10.42.4.93:59310->10.0.0.1:53: i/o timeout
This is from the start of the CoreDNS logs:
$ kubectl -n kube-system logs coredns-695688789-lm947
.:53
2019-06-12T19:01:15.388Z [INFO] CoreDNS-1.3.0
2019-06-12T19:01:15.389Z [INFO] linux/arm64, go1.11.4, c8f0e94
CoreDNS-1.3.0
linux/arm64, go1.11.4, c8f0e94
2019-06-12T19:01:15.389Z [INFO] plugin/reload: Running configuration MD5 = ef347efee19aa82f09972f89f92da1cf
2019-06-12T19:01:36.395Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:60396->10.0.0.1:53: i/o timeout
2019-06-12T19:01:39.397Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:56286->10.0.0.1:53: i/o timeout
2019-06-12T19:01:42.397Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:38791->10.0.0.1:53: i/o timeout
2019-06-12T19:01:45.399Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:39417->10.0.0.1:53: i/o timeout
2019-06-12T19:01:48.401Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:39276->10.0.0.1:53: i/o timeout
2019-06-12T19:01:51.401Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:36239->10.0.0.1:53: i/o timeout
2019-06-12T19:01:54.403Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:47541->10.0.0.1:53: i/o timeout
2019-06-12T19:01:57.404Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:39486->10.0.0.1:53: i/o timeout
2019-06-12T19:02:00.405Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:53211->10.0.0.1:53: i/o timeout
2019-06-12T19:02:03.405Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:53654->10.0.0.1:53: i/o timeout
2019-06-12T20:03:31.063Z [ERROR] plugin/errors: 2 update.containous.cloud. AAAA: unreachable backend: read udp 10.42.4.93:38504->10.0.0.1:53: i/o timeout
2019-06-12T20:03:36.064Z [ERROR] plugin/errors: 2 update.containous.cloud. AAAA: unreachable backend: read udp 10.42.4.93:38491->10.0.0.1:53: i/o timeout
2019-06-12T20:03:41.570Z [ERROR] plugin/errors: 2 api.github.com. AAAA: unreachable backend: read udp 10.42.4.93:56122->10.0.0.1:53: i/o timeout
2019-06-12T20:03:46.572Z [ERROR] plugin/errors: 2 api.github.com. AAAA: unreachable backend: read udp 10.42.4.93:39048->10.0.0.1:53: i/o timeout
2019-06-13T00:00:50.170Z [ERROR] plugin/errors: 2 stats.drone.ci. AAAA: unreachable backend: read udp 10.42.4.93:38093->10.0.0.1:53: i/o timeout
Cert-manager pod has working DNS:
$ kubectl exec -it -n utils cert-manager-66bc958d96-b6b7k -- nslookup gitea.gitea
nslookup: can't resolve '(null)': Name does not resolve
Name: gitea.gitea
Address 1: 10.43.111.72 gitea.gitea.svc.cluster.local
[lennart@legolas ~]$ kubectl exec -it -n utils cert-manager-66bc958d96-b6b7k -- nslookup www.google.com
nslookup: can't resolve '(null)': Name does not resolve
Name: www.google.com
Address 1: 216.58.207.228 arn09s19-in-f4.1e100.net
Address 2: 2a00:1450:400f:80c::2004 arn09s19-in-x04.1e100.net
Debugging DNS with busybox pods:
[lennart@legolas ~]$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
busybox 1/1 Running 47 2d 10.42.4.90 pippin <none> <none>
busybox-fili 1/1 Running 26 25h 10.42.0.132 fili <none> <none>
busybox-kili 1/1 Running 1 116m 10.42.2.167 kili <none> <none>
[lennart@legolas ~]$ kubectl exec -it busybox -- nslookup www.google.com
;; connection timed out; no servers could be reached
command terminated with exit code 1
[lennart@legolas ~]$ kubectl exec -it busybox -- nslookup gitea.gitea
;; connection timed out; no servers could be reached
command terminated with exit code 1
[lennart@legolas ~]$ kubectl exec -it busybox-fili -- nslookup www.google.com
Server: 10.43.0.10
Address: 10.43.0.10:53
Non-authoritative answer:
Name: www.google.com
Address: 2a00:1450:400f:80a::2004
*** Can't find www.google.com: No answer
[lennart@legolas ~]$ kubectl exec -it busybox-fili -- nslookup gitea.gitea
;; connection timed out; no servers could be reached
command terminated with exit code 1
[lennart@legolas ~]$ kubectl exec -it busybox-kili -- nslookup www.google.com
Server: 10.43.0.10
Address: 10.43.0.10:53
Non-authoritative answer:
Name: www.google.com
Address: 2a00:1450:400f:807::2004
*** Can't find www.google.com: No answer
[lennart@legolas ~]$ kubectl exec -it busybox-kili -- nslookup gitea.gitea
;; connection timed out; no servers could be reached
command terminated with exit code 1
Description of coredns ConfigMap:
====
Corefile:
----
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
}
hosts /etc/coredns/NodeHosts {
reload 1s
fallthrough
}
prometheus :9153
proxy . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
NodeHosts:
----
10.0.0.13 fili
10.0.0.2 pippin
10.0.0.15 kili
Some IP related prints:
fili-ip-route.txt
fili-iptables-save.txt
kili-ip-route.txt
kili-iptables-save.txt
pippin-ip-route.txt
pippin-iptables-save.txt
If you made it through all that, kudos to you! Sorry for the long description.
Thanks for reporting this issue and all of the info @lentzi90 !
I think a good clue is if metrics-server needs kubelet-insecure-tls that might indicate a cert issue. I am curious if there is some time drift between servers which may be causing an issue, if you aren't already syncing with ntp periodically that might be a good thing to test & setup. There may be additional information in the k3s server or agent logs which would prove useful.
Sorry, actually you probably need kubelet-insecure-tls with kubelet-preferred-address-types=InternalIP as I think we only provide a cert for the hostname. The times from the log files look close enough I am guessing that is not an issue. Can you provide some more info about the setup? Is it on a laptop, hosted, vm, etc? It is interesting that you are getting only ipv6 for nslookup www.google.com, from the iptables entry I am curious if CNI has run out of IPs or otherwise having issues. Do you have a lot of pods that restart or otherwise remove & deploy a lot of pods? If using the install script k3s-killall.sh may help to reset the containers and networking on the nodes, and then you would need to start up the k3s server/agents again.
Thanks for the fast response!
You were right about the insecure-tls part, the certs are just for the hostnames.
This is a bare metal setup with Raspberry Pis (specifically fili and kili are model 3B+ and pippin is 3B).
They are all connected to a switch which in turn is connected to my home router (IP 10.0.0.1).
I wouldn't say that I restart/deploy/remove a lot of pods, but this cluster has been running for several weeks so in that time maybe yes.
The master node is already over 100 days old :)
I did use the install script but the k3s-killall.sh script did not exist at the time.
Would it be enough to just restart the systemd units after running this script or would everything be wiped?
Here are some logs from the nodes. I'm afraid they are a bit messy and have been rotated at different times.
fili-k3s-server.txt
kili-k3s-agent.txt
pippin-k3s-agent.txt
Checking time for all nodes:
$ for node in fili kili pippin; do ssh $node date; done
Sat Jun 15 09:17:13 UTC 2019
Sat Jun 15 09:17:14 UTC 2019
Sat Jun 15 09:17:16 UTC 2019
Checking resolv.conf for all nodes:
$ for node in fili kili pippin; do ssh $node cat /etc/resolv.conf; done
# Generated by NetworkManager
nameserver 10.0.0.1
# Generated by NetworkManager
nameserver 10.0.0.1
# Generated by NetworkManager
nameserver 10.0.0.1
For reference, the systemd units I use:
# Server unit
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network-online.target
[Service]
Type=notify
EnvironmentFile=/etc/systemd/system/k3s.service.env
ExecStart=/usr/local/bin/k3s server --no-deploy=servicelb --kubelet-arg system-reserved=cpu=100m,memory=100Mi --kubelet-arg kube-reserved=cpu=200m,memory=300Mi
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
[Install]
WantedBy=multi-user.target
# Agent unit
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target
[Service]
Type=exec
EnvironmentFile=/etc/systemd/system/k3s-agent.service.env
ExecStart=/usr/local/bin/k3s agent --kubelet-arg system-reserved=cpu=100m,memory=100Mi --kubelet-arg kube-reserved=cpu=200m,memory=100Mi
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
Restart=always
[Install]
WantedBy=multi-user.target
I should probably also mention that SELinux is set to permissive and Firewalld is disabled.
I installed the killall script on all nodes and run it on one at a time. Unfortunately, it didn't help. :disappointed:
met the same problem
can't resolve by pod's hostname
thanks for all the great info @lentzi90 ! Is 10.0.0.1 pointing to DNS running on the router? Is it possible to view the state of the home router to ensure that something like the NAT table hasn't filled up and DNS server is good? Rebooting the router and making sure the wires are still good (ping test to google or something similar) might help. It might also help to perform nslookup tests against a specific server (maybe 8.8.8.8) or start k3s pointed to a different resolv.conf. Seems like an issue where maybe the network itself is having problems, or perhaps a DNS change upstream is causing issues. If it is an upstream DNS change I would think that killing everything and restarting would help, maybe a reboot of the router & nodes are in order too.
The switch may also be a suspect for dropping UDP packets.
@ericchiang On my situation, no dropping UDP packets. resolve address by service name is ok, only pod name resolving was failed.
dig @XXX some-service.default.svc.cluster.local OK
dig @XXX podname-n.some-service.default.svc.cluster.local FAILED with no adress returned
@chennqqi it looks like you might have accidentally pinged someone else by mistake
I think this issue is unique in that DNS was working fine for a long period of time but is now having sporadic issues. I suspect that it is a general network issue, so I am not sure there is a lot we can do without being more heavy handed and opinionated on the default resolv.conf settings for CoreDNS.
It looks like you are having a basic configuration issue with your pods @chennqqi, it might be worth looking over the instructions at https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pods, and file a new issue with lots of information if you are still having issues. For what it is worth I was able to modify the example a little to verify pod DNS is working:
[ 2019-06-19 17:56:27 ]
root馃惍k3s-1:~$ kubectl exec -ti busybox -- nslookup k3s-1.default-subdomain.default.svc.cluster.local
Server: 10.43.0.10
Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local
Name: k3s-1.default-subdomain.default.svc.cluster.local
Address 1: 10.42.1.3 k3s-1.default-subdomain.default.svc.cluster.local
[ 2019-06-19 17:56:34 ]
root馃惍k3s-1:~$ kubectl exec -ti busybox -- nslookup default-subdomain.default.svc.cluster.local
Server: 10.43.0.10
Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local
Name: default-subdomain.default.svc.cluster.local
Address 1: 10.42.1.3 k3s-1.default-subdomain.default.svc.cluster.local
Address 2: 10.42.0.7 k3s-2.default-subdomain.default.svc.cluster.local
@erikwilson you're right. I checked my service.yml, add subdomain, pod resolve ok now. Thank you very much!
@erikwilson the router at 10.0.0.1 is a very simple netgear router for home usage. I tried restarting it but it didn't help. As far as I can see it is operating normally, all laptops, phones and other devices connected to it work just fine.
Also, all nodes have been updated to k3s 0.6.1 now and rebooted without any effect.
The router is running a DHCP server configured to give out addresses from 10.0.0.2 to 10.0.0.254. It is using 8.8.8.8 and 8.8.4.4 for DNS.
I did some more debugging:
fili (master) and pippin (worker, running the coreDNS pod at the time) directly to the router instead of to the switch. No effect./etc/resolv.conf to include 8.8.8.8 and 8.8.4.4 (in addition to the automatic 10.0.0.1) on all nodes. No effect.I have run into #544 or similar before on Ubuntu with kubeadm but I don't think this is the same issue. Actually systemd-resolved is not running on any of the machines, could that be a problem?
Found a pattern: The node where CoreDNS is running is unable to nslookup www.google.com if I don't specify a server. (If I do nslookup www.google.com 8.8.8.8 they all work fine.) Any Idea what this could mean?
# Note that coredns is running on node pippin
$ kubectl get pods -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-695688789-625vl 1/1 Running 0 8m5s 10.42.4.24 pippin <none> <none>
helm-install-traefik-lnx75 0/1 Completed 0 64m 10.42.2.17 kili <none> <none>
metrics-server-7cf965b7d-k4h5v 1/1 Running 0 5h16m 10.42.4.22 pippin <none> <none>
tiller-deploy-5d47d8c8f7-2m976 1/1 Running 0 5h16m 10.42.4.20 pippin <none> <none>
traefik-56688c4464-47t4h 1/1 Running 0 63m 10.42.2.18 kili <none> <none>
# Lookup from pod running on pippin fails
$ kubectl exec -it busybox-pippin -- nslookup www.google.com
;; connection timed out; no servers could be reached
command terminated with exit code 1
# Lookup on other node works
$ kubectl exec -it busybox-kili -- nslookup www.google.com
Server: 10.43.0.10
Address: 10.43.0.10:53
Non-authoritative answer:
Name: www.google.com
Address: 172.217.20.36
*** Can't find www.google.com: No answer
Now if I cordon pippin and kill the coredns pod it ends up on kili instead and I get the reverse result:
# Note that coredns is running on node kili
$ kubectl get pods -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-695688789-8jclm 1/1 Running 0 5m45s 10.42.2.20 kili <none> <none>
helm-install-traefik-lnx75 0/1 Completed 0 74m 10.42.2.17 kili <none> <none>
metrics-server-7cf965b7d-k4h5v 1/1 Running 0 5h26m 10.42.4.22 pippin <none> <none>
tiller-deploy-5d47d8c8f7-2m976 1/1 Running 0 5h26m 10.42.4.20 pippin <none> <none>
traefik-56688c4464-47t4h 1/1 Running 0 73m 10.42.2.18 kili <none> <none>
# Lookup from pod running on kili fails
$ kubectl exec -it busybox-kili -- nslookup www.google.com
;; connection timed out; no servers could be reached
command terminated with exit code 1
# Lookup on other node works
$ kubectl exec -it busybox-pippin -- nslookup www.google.com
Server: 10.43.0.10
Address: 10.43.0.10:53
Non-authoritative answer:
Name: www.google.com
Address: 216.58.207.196
*** Can't find www.google.com: No answer
Nothing interesting in the CoreDNS logs:
$ kubectl -n kube-system logs coredns-695688789-8jclm
.:53
2019-06-21T14:55:05.782Z [INFO] CoreDNS-1.3.0
2019-06-21T14:55:05.782Z [INFO] linux/arm64, go1.11.4, c8f0e94
CoreDNS-1.3.0
linux/arm64, go1.11.4, c8f0e94
2019-06-21T14:55:05.783Z [INFO] plugin/reload: Running configuration MD5 = ef347efee19aa82f09972f89f92da1cf
I'll give it another few days but if I don't find a solution after that I'll just reinstall.
met the same problem
can't resolve by pod's hostname
EDIT: Though I think there are two separate issues here, one about no DNS records being present for pods and another about the larger DNS issue with the nodes that @lentzi90 is having. Maybe this should be split into two separate tickets? EDIT2: Actually I didn't have an issue after all. I was looking for the old .pod.cluster.local DNS entries which apparently have been replaced by
Yes, it should be split into another ticket @varesa. This issue isn't about resolving a pod's DNS entries, it is about resolving any DNS from within a pod on a system which used to function fine.
Please look over this comment and file a new issue: https://github.com/rancher/k3s/issues/535#issuecomment-503800950
@erikwilson actually it turns out that this was also a misunderstanding on my part in addition to being off-topic (even if the topic was already mixed here before). Cleaned up some of the irrelevant information as it's unlikely to be of use to anyone. Hope you find a resolution to the real issue here as well
Alright, here is an update.
I reinstalled the cluster and it is now working fine. Metrics-server works without any extra args and all pods can resolve services just the way they should. I used the install script to install k3s after some struggles with other methods (see below).
Is there anything more I could do to find the cause of this, or should I just close the issue for now?
As a side note, I tried to use ansible to set everything up this time instead of using the install script. It did not go well. Basically there were problems with containerd all the time. Anyway, that doesn't belong in this issue. I will investigate some more and open a new issue if necessary.
I am too seeing this issue, my cluster is now ~32 days old and I am seeing issues of failed DNS requests either internal or external requests now. The issue is very strange because it's sporadic the DNS requests start working but then after awhile start failing and then the cycles repeats. I have turned on logging for coredns and I hope to capture something useful.
Hi all, I'm currently facing with this issue too
ENV:

I don't know if this helps anyone else. But on my raspi 4 cluster, I installed docker.io package and that's when DNS inside the cluster stopped working. apt-get remove docker.io solved this particular issue for me
I ran into a very similar problem and solved it by restarting dockerd on the system.
Before the fix, my coredns pod had exact same error logs as OP:
2019-06-12T19:01:36.395Z [ERROR] plugin/errors: 2 3521834610273354494.4337686964088628928. HINFO: unreachable backend: read udp 10.42.4.93:60396->10.0.0.1:53: i/o timeoutThen in
journalctlI saw this message, and followed this post.
May 4 10:44:44 ari dockerd: time="2020-05-04T10:44:44.186337155+08:00" level=warning msg="IPv4 forwarding is disabled. Networking will not work."
similar to @zackb , after removing docker.io from my master node and rebooting, and killing all pods, everything returned back to normal. Looks like it's related to docker using the older iptables vs k3s using nftables -- mixing both is recipe for disaster it seems.
Im having the same issues, but restarting k3s systemctl restart k3s on master and agents fixes it.
I have a similar problem I even tried setting up nodelocaldns in order to improve the situation but the problem still exists.
I have encounted this problem many times, This problem bothering me for a long long time.
It seems it was caused by wrong iptable rules, But I didnt find the root cause,
The direct phenomenon is you cannot accsess other services with clusterr ip, so all the pod running on the node cannot access kube-dns service, when I encounter this problem, The following method works:
iptables -F
iptables -X
iptables -F -t nat
iptables -X -t nat
the command above flushes the iptable rules, then restart k3s to recreate the iptable rules, The problem resolved, but I don't know when it happenes again, because running for hours or days, it happens again.
the fowwling are the iptable rules snapshot(the left is when the node is abnormal, the right is when if normal.):

Most helpful comment
I don't know if this helps anyone else. But on my raspi 4 cluster, I installed docker.io package and that's when DNS inside the cluster stopped working.
apt-get remove docker.iosolved this particular issue for me