Kops: On GCP, when change image to Ubuntu, DNS does not work

Created on 11 Feb 2020 · 15Comments · Source: kubernetes/kops

We need use Ubuntu instead of google container optimized OS. When I change the commands like below:
kops create cluster --name=${KOPS_CLUSTER_NAME} --node-count=1 --node-size=n1-standard-2 --master-count=1 --zones us-central1-a --image "ubuntu-os-cloud/ubuntu-1804-bionic-v20190617" --state ${KOPS_STATE_STORE} --project=${PROJECT}

I noticed kube-dns keeps crashing. Even I edit kubelet configurations and change from kube-dns to coredns, still keep crashing.

If I create cluster with default google container optimized image, there is no DNS problems. K8s cluster works fine after creation.

Kops version used:
Version 1.15.0 (git-9992b4055)

Source

yxiay2k

Most helpful comment

@robinpercy I cherry-picked the fix to 1.16 and 1.17 (still unreleased). Will try to find time to add something in the release notes for 1.15.

hakman on 15 Feb 2020

👍2

All 15 comments

Thanks for the report @yxiay2k. I'll try to reproduce and will update with what I find.

robinpercy on 11 Feb 2020

@yxiay2k I haven't been able to reproduce that behavior. kube-dns comes up healthy each time I deploy using the command you provided.

Do you see anything in the logs for kube-dns? Are you seeing pods dying, or containers within them restarting?

robinpercy on 12 Feb 2020

Thanks for getting back to me!

When I first bring up the k8s cluster, seems everything is running without errors.

After I deploy my test images:
kubectl run redis-crypt-service --image=redis --requests=cpu=200m --expose --port=6379
kubectl run crypt-service --image=yxiay2k/cryptserver:v2.0 --requests=cpu=800m --expose --port=8076

If DNS works, cryptserver could connect to redis successfully with the logs:
Connected to Redis
On master node, you actually can curl to http://cryptserver_clusterip:8076

But on this k8s cluster, the connection failed and cryptserver keeps restart itself.
NAMESPACE NAME READY STATUS RESTARTS AGE
default crypt-service-7d564cbbf5-wrgbn 0/1 Error 2 79s
default redis-crypt-service-7c468dfcf-pvjjp 1/1 Running 0 93s

And I saw kube-dns restarts too:
kube-system kube-dns-5fdb85bb5b-4jbjc 3/3 Running 2 6m16s
kube-system kube-dns-5fdb85bb5b-54gdh 3/3 Running 2 7m17s

When I describe Events:
Type Reason Age ---- ------ ---- Normal Scheduled 18m Normal Pulled 18m Normal Created 18m Normal Started 18m Normal Pulling 18m Normal Pulled 18m Normal Pulling 18m Normal Pulled 18m Normal Killing 12m Normal Created Normal Started Normal Pulled 12m Normal Created Normal Started Normal Pulled 12m Warning Unhealthy 12m Warning the kube-dns pod, I saw the following logs, are they helpful?
From Message
---- -------
default-scheduler Successfully assigned kube-system/kube-dns-5fdb85bb5b-4jbjc to nodes-p72f
kubelet, nodes-p72f Container image "k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.13" already present on machine
kubelet, nodes-p72f Created container kubedns
kubelet, nodes-p72f Started container kubedns
kubelet, nodes-p72f Pulling image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.13"
kubelet, nodes-p72f Successfully pulled image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.13"
kubelet, nodes-p72f Pulling image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.13"
kubelet, nodes-p72f Successfully pulled image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.13"
kubelet, nodes-p72f Container dnsmasq failed liveness probe, will be restarted
12m (x2 over 18m) kubelet, nodes-p72f Created container dnsmasq
12m (x2 over 18m) kubelet, nodes-p72f Started container dnsmasq
kubelet, nodes-p72f Container image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.13" already present on machine
12m (x2 over 18m) kubelet, nodes-p72f Created container sidecar
12m (x2 over 18m) kubelet, nodes-p72f Started container sidecar
kubelet, nodes-p72f Container image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.13" already present on machine
kubelet, nodes-p72f Liveness probe failed: HTTP probe failed with statuscode: 503
Unhealthy 98s (x16 over 13m) kubelet, nodes-p72f Liveness probe failed: HTTP probe failed with statuscode: 503

yxiay2k on 12 Feb 2020

BTW, this is a GCP trial account, we are still on the stage of evaluating GCP. We used AWS more and never had problems with KOPS on AWS before.

yxiay2k on 12 Feb 2020

The following logs hopefully to be helpful too @robinpercy

$ kubectl logs kube-dns-5fdb85bb5b-4jbjc dnsmasq --namespace=kube-system
I0212 21:18:19.805682 1 main.go:74] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --dns-forward-max=150 --no-negcache --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053 --min-port=1024] true} /etc/k8s/dns/dnsmasq-nanny 10000000000}
I0212 21:18:19.805893 1 nanny.go:94] Starting dnsmasq [-k --cache-size=1000 --dns-forward-max=150 --no-negcache --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053 --min-port=1024]
I0212 21:18:20.320678 1 nanny.go:119]
W0212 21:18:20.320715 1 nanny.go:120] Got EOF from stdout
I0212 21:18:20.322200 1 nanny.go:116] dnsmasq[11]: started, version 2.78 cachesize 1000
I0212 21:18:20.322241 1 nanny.go:116] dnsmasq[11]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0212 21:18:20.322251 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in6.arpa
I0212 21:18:20.322259 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I0212 21:18:20.322265 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain cluster.local
I0212 21:18:20.322274 1 nanny.go:116] dnsmasq[11]: reading /etc/resolv.conf
I0212 21:18:20.322283 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in6.arpa
I0212 21:18:20.322289 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I0212 21:18:20.322295 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain cluster.local
I0212 21:18:20.322350 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.53#53
I0212 21:18:20.322357 1 nanny.go:116] dnsmasq[11]: read /etc/hosts - 7 addresses
I0212 21:22:28.196837 1 nanny.go:116] dnsmasq[11]: Maximum number of concurrent DNS queries reached (max: 150)
I0212 21:22:38.215485 1 nanny.go:116] dnsmasq[11]: Maximum number of concurrent DNS queries reached (max: 150)
I0212 21:22:48.226649 1 nanny.go:116] dnsmasq[11]: Maximum number of concurrent DNS queries reached (max: 150)
.......

yxiay2k on 12 Feb 2020

Thanks @yxiay2k . I just had a chance to deploy those to workloads and saw the exact same behavior you described 😳 Very strange...

So you've had the same version of cryptserver and resdis running successfully on AWS? Was it on a similar single-node cluster with similar CPU and Memory available?

robinpercy on 13 Feb 2020

@robinpercy Yes, we have been running these images for a while, both on k8s clusters on AWS, Azure and on-prem, deployed with Kops, kubespray, AKS, EKS etc. Never have any issues before.

Another thing is if I run the kops command with google default images, that is remove the part: '--image "ubuntu-os-cloud/ubuntu-1804-bionic-v20190617"' from my command, and deploy the same 2 images on the cluster created, cryptserver and redis, they work just fine and DNS does not have any problems.

yxiay2k on 13 Feb 2020

@yxiay2k it looks like we've run into this problem with Ubuntu: https://simonfredsted.com/1680

I've verified the workaround from that post by editing /etc/sysconfig/kubelet on the node to include --resolv-conf=/run/systemd/resolve/resolv.conf in DAEMON_ARGS. Then issuing a systemctl restart kubelet. After killing the kube-dns pods and crypt-service pods, the new crypt-service pod was able to resolve redis.

In theory, you should be able to specify that resolv.conf in the cluster spec, as per https://github.com/kubernetes/kops/blob/master/docs/cluster_spec.md#kubelet but I haven't tested it yet.

It looks like we probably need to change the default behavior of Kops for ubuntu images, but I'll have to spend some more time investigating the implications.

robinpercy on 13 Feb 2020

@robinpercy this is fixed here https://github.com/kubernetes/kops/pull/8353, but maybe it should be backported to the release branches.

hakman on 13 Feb 2020

@robinpercy Thank you so much for your help!
I use "kops edit cluster", add that line to kubelet configurations and then run "kops update cluster".
It works fine for me now!
Should I close this issue or leave it open?

yxiay2k on 13 Feb 2020

@hakman Thanks a lot for your info! This issue could be closed now.

yxiay2k on 13 Feb 2020

👍1

Glad it's working @yxiay2k! And yes, please close this one.

@hakman thanks for that info. I'm torn on whether it should be back-ported, since it would be altering default behavior in a patch release. I'm inclined to leave it where it is but document the workaround in the version branches.

robinpercy on 14 Feb 2020

@robinpercy I cherry-picked the fix to 1.16 and 1.17 (still unreleased). Will try to find time to add something in the release notes for 1.15.

hakman on 15 Feb 2020

👍2

/close

hakman on 15 Feb 2020

@hakman: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.