We need use Ubuntu instead of google container optimized OS. When I change the commands like below:
kops create cluster --name=${KOPS_CLUSTER_NAME} --node-count=1 --node-size=n1-standard-2 --master-count=1 --zones us-central1-a --image "ubuntu-os-cloud/ubuntu-1804-bionic-v20190617" --state ${KOPS_STATE_STORE} --project=${PROJECT}
I noticed kube-dns keeps crashing. Even I edit kubelet configurations and change from kube-dns to coredns, still keep crashing.
If I create cluster with default google container optimized image, there is no DNS problems. K8s cluster works fine after creation.
Kops version used:
Version 1.15.0 (git-9992b4055)
Thanks for the report @yxiay2k. I'll try to reproduce and will update with what I find.
@yxiay2k I haven't been able to reproduce that behavior. kube-dns comes up healthy each time I deploy using the command you provided.
Do you see anything in the logs for kube-dns? Are you seeing pods dying, or containers within them restarting?
Thanks for getting back to me!
When I first bring up the k8s cluster, seems everything is running without errors.
After I deploy my test images:
kubectl run redis-crypt-service --image=redis --requests=cpu=200m --expose --port=6379
kubectl run crypt-service --image=yxiay2k/cryptserver:v2.0 --requests=cpu=800m --expose --port=8076
If DNS works, cryptserver could connect to redis successfully with the logs:
Connected to Redis
On master node, you actually can curl to http://cryptserver_clusterip:8076
But on this k8s cluster, the connection failed and cryptserver keeps restart itself.
NAMESPACE NAME READY STATUS RESTARTS AGE
default crypt-service-7d564cbbf5-wrgbn 0/1 Error 2 79s
default redis-crypt-service-7c468dfcf-pvjjp 1/1 Running 0 93s
And I saw kube-dns restarts too:
kube-system kube-dns-5fdb85bb5b-4jbjc 3/3 Running 2 6m16s
kube-system kube-dns-5fdb85bb5b-54gdh 3/3 Running 2 7m17s
When I describe the kube-dns pod, I saw the following logs, are they helpful?
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18m default-scheduler Successfully assigned kube-system/kube-dns-5fdb85bb5b-4jbjc to nodes-p72f
Normal Pulled 18m kubelet, nodes-p72f Container image "k8s.gcr.io/k8s-dns-kube-dns-amd64:1.14.13" already present on machine
Normal Created 18m kubelet, nodes-p72f Created container kubedns
Normal Started 18m kubelet, nodes-p72f Started container kubedns
Normal Pulling 18m kubelet, nodes-p72f Pulling image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.13"
Normal Pulled 18m kubelet, nodes-p72f Successfully pulled image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.13"
Normal Pulling 18m kubelet, nodes-p72f Pulling image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.13"
Normal Pulled 18m kubelet, nodes-p72f Successfully pulled image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.13"
Normal Killing 12m kubelet, nodes-p72f Container dnsmasq failed liveness probe, will be restarted
Normal Created 12m (x2 over 18m) kubelet, nodes-p72f Created container dnsmasq
Normal Started 12m (x2 over 18m) kubelet, nodes-p72f Started container dnsmasq
Normal Pulled 12m kubelet, nodes-p72f Container image "k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.14.13" already present on machine
Normal Created 12m (x2 over 18m) kubelet, nodes-p72f Created container sidecar
Normal Started 12m (x2 over 18m) kubelet, nodes-p72f Started container sidecar
Normal Pulled 12m kubelet, nodes-p72f Container image "k8s.gcr.io/k8s-dns-sidecar-amd64:1.14.13" already present on machine
Warning Unhealthy 12m kubelet, nodes-p72f Liveness probe failed: HTTP probe failed with statuscode: 503
Warning Unhealthy 98s (x16 over 13m) kubelet, nodes-p72f Liveness probe failed: HTTP probe failed with statuscode: 503
BTW, this is a GCP trial account, we are still on the stage of evaluating GCP. We used AWS more and never had problems with KOPS on AWS before.
The following logs hopefully to be helpful too @robinpercy
$ kubectl logs kube-dns-5fdb85bb5b-4jbjc dnsmasq --namespace=kube-system
I0212 21:18:19.805682 1 main.go:74] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --dns-forward-max=150 --no-negcache --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053 --min-port=1024] true} /etc/k8s/dns/dnsmasq-nanny 10000000000}
I0212 21:18:19.805893 1 nanny.go:94] Starting dnsmasq [-k --cache-size=1000 --dns-forward-max=150 --no-negcache --log-facility=- --server=/cluster.local/127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/in6.arpa/127.0.0.1#10053 --min-port=1024]
I0212 21:18:20.320678 1 nanny.go:119]
W0212 21:18:20.320715 1 nanny.go:120] Got EOF from stdout
I0212 21:18:20.322200 1 nanny.go:116] dnsmasq[11]: started, version 2.78 cachesize 1000
I0212 21:18:20.322241 1 nanny.go:116] dnsmasq[11]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0212 21:18:20.322251 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in6.arpa
I0212 21:18:20.322259 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I0212 21:18:20.322265 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain cluster.local
I0212 21:18:20.322274 1 nanny.go:116] dnsmasq[11]: reading /etc/resolv.conf
I0212 21:18:20.322283 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in6.arpa
I0212 21:18:20.322289 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I0212 21:18:20.322295 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain cluster.local
I0212 21:18:20.322350 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.53#53
I0212 21:18:20.322357 1 nanny.go:116] dnsmasq[11]: read /etc/hosts - 7 addresses
I0212 21:22:28.196837 1 nanny.go:116] dnsmasq[11]: Maximum number of concurrent DNS queries reached (max: 150)
I0212 21:22:38.215485 1 nanny.go:116] dnsmasq[11]: Maximum number of concurrent DNS queries reached (max: 150)
I0212 21:22:48.226649 1 nanny.go:116] dnsmasq[11]: Maximum number of concurrent DNS queries reached (max: 150)
.......
Thanks @yxiay2k . I just had a chance to deploy those to workloads and saw the exact same behavior you described 馃槼 Very strange...
So you've had the same version of cryptserver and resdis running successfully on AWS? Was it on a similar single-node cluster with similar CPU and Memory available?
@robinpercy Yes, we have been running these images for a while, both on k8s clusters on AWS, Azure and on-prem, deployed with Kops, kubespray, AKS, EKS etc. Never have any issues before.
Another thing is if I run the kops command with google default images, that is remove the part: '--image "ubuntu-os-cloud/ubuntu-1804-bionic-v20190617"' from my command, and deploy the same 2 images on the cluster created, cryptserver and redis, they work just fine and DNS does not have any problems.
@yxiay2k it looks like we've run into this problem with Ubuntu: https://simonfredsted.com/1680
I've verified the workaround from that post by editing /etc/sysconfig/kubelet on the node to include --resolv-conf=/run/systemd/resolve/resolv.conf in DAEMON_ARGS. Then issuing a systemctl restart kubelet. After killing the kube-dns pods and crypt-service pods, the new crypt-service pod was able to resolve redis.
In theory, you should be able to specify that resolv.conf in the cluster spec, as per https://github.com/kubernetes/kops/blob/master/docs/cluster_spec.md#kubelet but I haven't tested it yet.
It looks like we probably need to change the default behavior of Kops for ubuntu images, but I'll have to spend some more time investigating the implications.
@robinpercy this is fixed here https://github.com/kubernetes/kops/pull/8353, but maybe it should be backported to the release branches.
@robinpercy Thank you so much for your help!
I use "kops edit cluster", add that line to kubelet configurations and then run "kops update cluster".
It works fine for me now!
Should I close this issue or leave it open?
@hakman Thanks a lot for your info! This issue could be closed now.
Glad it's working @yxiay2k! And yes, please close this one.
@hakman thanks for that info. I'm torn on whether it should be back-ported, since it would be altering default behavior in a patch release. I'm inclined to leave it where it is but document the workaround in the version branches.
@robinpercy I cherry-picked the fix to 1.16 and 1.17 (still unreleased). Will try to find time to add something in the release notes for 1.15.
/close
@hakman: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
@robinpercy I cherry-picked the fix to 1.16 and 1.17 (still unreleased). Will try to find time to add something in the release notes for 1.15.