Hi,
pods started in kind on travis seem to have lost network connectivity
https://travis-ci.com/lukasheinrich/examples/builds/113218001
sudo: required
language: go
go:
- "1.12"
services:
- docker
install: true
script:
- echo "Run your tests here"
before_install:
# Download and install Kind and kubectl
- GO111MODULE=on go get sigs.k8s.io/kind
- kind create cluster --config kind-config.yaml
- export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
- curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
- chmod +x ./kubectl
- curl http://google.com
- docker run --rm -it alpine sh -c 'apk add curl;curl http://google.com'
- ./kubectl run -it hello --image alpine -- sh -c 'apk add curl;curl http://google.com'
the last command gives
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
If you don't see a command prompt, try pressing enter.
ERROR: http://dl-cdn.alpinelinux.org/alpine/v3.9/main: temporary error (try again later)
WARNING: Ignoring APKINDEX.b89edf6e.tar.gz: No such file or directory
fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz
ERROR: http://dl-cdn.alpinelinux.org/alpine/v3.9/community: temporary error (try again later)
WARNING: Ignoring APKINDEX.737f7e01.tar.gz: No such file or directory
ERROR: unsatisfiable constraints:
curl (missing):
required by: world[curl]
sh: curl: not found
this seems to have been introduced with newer versions of kind (or travis) since I have successfully used this before. Docker seems to be unaffected (see docker run above).
Any ideas what this might be?
If this should be moved to an issue at https://github.com/kind-ci please let me know.
@lukasheinrich can you add a command to get the logs from kind and store them somewhere?
kind export logs
@aojea here's a gist with the logs
https://gist.github.com/lukasheinrich/9804850552818b6cfc53906b7a7dcce9
(updated .travis.yml here: https://github.com/lukasheinrich/examples/blob/master/.travis.yml)
I notice some lines mentioning that the cni plugin fails to initialize.
May 26 18:12:31 kind-worker containerd[46]: time="2019-05-26T18:12:31.786335741Z" level=error msg="Failed to load cni configuration" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
May 26 18:12:48 kind-worker containerd[46]: time="2019-05-26T18:12:48.262263971Z" level=error msg="Failed to load cni configuration" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
May 26 18:12:53 kind-worker containerd[46]: time="2019-05-26T18:12:53.433725012Z" level=error msg="Failed to load cni configuration" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
May 26 18:12:58 kind-worker containerd[46]: time="2019-05-26T18:12:58.434854145Z" level=error msg="Failed to load cni configuration" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
Maybe that鈥檚 a hint?
I麓m trying to reproduce it https://travis-ci.org/aojea/examples
The cni plugin eventually gets ready, those errors can be a hint but in this case they are not. If the cni plugin is not ready the nodes won麓t be in ready status.
I麓d like to check if is a problem that DNS doesn麓t resolve or that pods can麓t get out of the container.
--- EDIT ---
Seems that the pods are not able to resolve external dns names, https://travis-ci.org/aojea/examples/jobs/537557493
./kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
48.27s$ ./kubectl exec -ti busybox -- nslookup www.google.com
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
nslookup: can't resolve 'www.google.com'
command terminated with exit code 1
@aojea thanks for looking into it. Can I provide any more info?
thank you @lukasheinrich
I could advance on the issue, it's a problem resolving external domains, as you can see in the snippet below the pod can reach external ips but can't resolve the external domain
$ ./kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
before_install.16
48.24s$ ./kubectl exec -ti busybox -- nslookup www.google.com || true
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
nslookup: can't resolve 'www.google.com'
command terminated with exit code 1
before_install.17
0.24s$ ./kubectl exec -ti busybox -- ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
link/ether 02:aa:21:f3:44:6b brd ff:ff:ff:ff:ff:ff
inet 10.244.1.2/24 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::aa:21ff:fef3:446b/64 scope link
valid_lft forever preferred_lft forever
before_install.18
135.31s$ ./kubectl exec -ti busybox -- traceroute 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 46 byte packets
1 10.244.1.1 (10.244.1.1) 0.004 ms 0.003 ms 0.001 ms
2 172.17.0.1 (172.17.0.1) 0.001 ms 0.006 ms 0.001 ms
3 10.10.1.5 (10.10.1.5) 1.549 ms 10.10.1.8 (10.10.1.8) 1.691 ms 1.465 ms
4 * * *
5 * * *
6 * * *
7 * * *
8 * * *
9 * * *
10 * * *
11 8.8.8.8 (8.8.8.8) 2.615 ms 2.088 ms 2.273 ms
install
0.01s$ true
馃 seems that the issue can be related to the DNS server used in travis
0.09s$ docker exec kind-control-plane sh -c 'cat /etc/resolv.conf'
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 169.254.169.254
search c.travis-ci-prod-2.internal google.internal
options timeout:5
options attempts:5
@tao12345666333 has hit a similar issue on his patch to allow kind to create their own networks
so as a short-term hack should
docker exec kind-control-plane sh -c 'echo 8.8.8.8 >> /etc/resolv.conf'
work? Does kind take over the resolv.conf of its host?
Thanks for tracking this down.
We had a long debate about the resolv.conf that you can follow on this thread https://github.com/kubernetes-sigs/kind/pull/484#issuecomment-489431439 .
I was doing some test and seems that travis doesn't permit to overwrite the resolv.conf, as a workaround we should try to see what of the documented setups work in travis
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/
https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/#coredns
Thanks, it's interesting because kind definitely used to work (incl network) on travis when I first set up the CI (February). Not sure at which point it broke
I've found the problem, the ipmasq-agent adds a rule to non masquerade the traffic to the link-local network breaking the connectivity between the cluster DNS and the host DNS, since the host DNS ip address belongs to that range.
-A IP-MASQ-AGENT -d 169.254.0.0/16 -m comment --comment "ip-masq-agent: local traffic is not subject to MASQUERADE" -j RETURN
for p in $(./kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name); do ./kubectl logs --namespace=kube-system $p; done
.:53
2019-05-27T11:19:36.353Z [INFO] CoreDNS-1.3.1
2019-05-27T11:19:36.353Z [INFO] linux/amd64, go1.11.4, 6b56a9c
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c
2019-05-27T11:19:36.353Z [INFO] plugin/reload: Running configuration MD5 = 599b9eb76b8c147408aed6a0bbe0f669
2019-05-27T11:19:42.355Z [ERROR] plugin/errors: 2 6195421070260342714.8357799154935943776. HINFO: read udp 10.244.0.3:52331->169.254.169.254:53: i/o timeout
2019-05-27T11:19:45.355Z [ERROR] plugin/errors: 2 6195421070260342714.8357799154935943776. HINFO: read udp 10.244.0.3:41512->169.254.169.254:53: i/o timeout
2019-05-27T11:19:46.354Z [ERROR] plugin/errors: 2 6195421070260342714.8357799154935943776. HINFO: read udp 10.244.0.3:49238->169.254.169.254:53: i/o timeout
2019-05-27T11:19:47.354Z [ERROR] plugin/errors: 2 6195421070260342714.8357799154935943776. HINFO: read udp 10.244.0.3:57052->169.254.169.254:53: i/o timeout
2019-05-27T11:19:50.355Z [ERROR] plugin/errors: 2 6195421070260342714.8357799154935943776. HINFO: read udp 10.244.0.3:49625->169.254.169.254:53: i/o timeout
2019-05-27T11:19:53.356Z [ERROR] plugin/errors: 2 6195421070260342714.8357799154935943776. HINFO: read udp 10.244.0.3:35345->169.254.169.254:53: i/o timeout
2019-05-27T11:19:56.357Z [ERROR] plugin/errors: 2 6195421070260342714.8357799154935943776. HINFO: read udp 10.244.0.3:59890->169.254.169.254:53: i/o timeout
2019-05-27T11:19:59.358Z [ERROR] plugin/errors: 2 6195421070260342714.8357799154935943776. HINFO: read udp 10.244.0.3:52423->169.254.169.254:53: i/o timeout
2019-05-27T11:20:01.214Z [ERROR] plugin/errors: 2 www.google.com. AAAA: read udp 10.244.0.3:37466->169.254.169.254:53: i/o timeout
2019-05-27T11:20:02.358Z [ERROR] plugin/errors: 2 6195421070260342714.8357799154935943776. HINFO: read udp 10.244.0.3:40731->169.254.169.254:53: i/o timeout
2019-05-27T11:20:05.217Z [ERROR] plugin/errors: 2 www.google.com.google.internal. AAAA: read udp 10.244.0.3:47136->169.254.169.254:53: i/o timeout
2019-05-27T11:20:05.359Z [ERROR] plugin/errors: 2 6195421070260342714.8357799154935943776. HINFO: read udp 10.244.0.3:50467->169.254.169.254:53: i/o timeout
2019-05-27T11:20:07.217Z [ERROR] plugin/errors: 2 www.google.com. AAAA: read udp 10.244.0.3:35281->169.254.169.254:53: i/o timeout
2019-05-27T11:20:11.220Z [ERROR] plugin/errors: 2 www.google.com.google.internal. AAAA: read udp 10.244.0.3:44330->169.254.169.254:53: i/o timeout
2019-05-27T11:20:13.222Z [ERROR] plugin/errors: 2 www.google.com. AAAA: read udp 10.244.0.3:58693->169.254.169.254:53: i/o timeout
2019-05-27T11:20:17.225Z [ERROR] plugin/errors: 2 www.google.com.google.internal. AAAA: read udp 10.244.0.3:38230->169.254.169.254:53: i/o timeout
2019-05-27T11:20:23.229Z [ERROR] plugin/errors: 2 www.google.com.google.internal. AAAA: read udp 10.244.0.3:55421->169.254.169.254:53: i/o timeout
2019-05-27T11:20:27.232Z [ERROR] plugin/errors: 2 www.google.com.c.eco-emissary-99515.internal. A: read udp 10.244.0.3:43721->169.254.169.254:53: i/o timeout
2019-05-27T11:20:29.233Z [ERROR] plugin/errors: 2 www.google.com.google.internal. A: read udp 10.244.0.3:50274->169.254.169.254:53: i/o timeout
2019-05-27T11:20:37.237Z [ERROR] plugin/errors: 2 www.google.com. A: read udp 10.244.0.3:42297->169.254.169.254:53: i/o timeout
2019-05-27T11:20:39.239Z [ERROR] plugin/errors: 2 www.google.com.c.eco-emissary-99515.internal. A: read udp 10.244.0.3:58562->169.254.169.254:53: i/o timeout
2019-05-27T11:20:41.241Z [ERROR] plugin/errors: 2 www.google.com.google.internal. A: read udp 10.244.0.3:38125->169.254.169.254:53: i/o timeout
2019-05-27T11:20:43.242Z [ERROR] plugin/errors: 2 www.google.com. A: read udp 10.244.0.3:47642->169.254.169.254:53: i/o timeout
2019-05-27T11:20:45.244Z [ERROR] plugin/errors: 2 www.google.com.c.eco-emissary-99515.internal. A: read udp 10.244.0.3:47169->169.254.169.254:53: i/o timeout
2019-05-27T11:20:47.245Z [ERROR] plugin/errors: 2 www.google.com.google.internal. A: read udp 10.244.0.3:51052->169.254.169.254:53: i/o timeout
.:53
/assign @aojea
/assign
/assign @aojea
@lukasheinrich the patch was merged but until the new node images are not published you have to create the node image yourself
- GO111MODULE=on go get sigs.k8s.io/kind
- kind build node-image --type apt
- kind create cluster --config kind-config.yaml --image kindest/node:latest
Please provide feedback
Hi @aojea
Thanks! I tried adding the image build step to the travis ci, but it seems to not yet be enough. Am I missing a step?
https://github.com/lukasheinrich/examples/blob/master/.travis.yml#L24
@lukasheinrich use
GO111MODULE=on go get -u -v sigs.k8s.io/kind@master
if that works for you better to pin the version to the specific commit until there is a new release, in this case
GO111MODULE=on go get -u -v sigs.k8s.io/kind@ab59991e3d0c45866efae345b170a47e193e9cdf
Thanks! this seems to have worked :)