yes, BUG REPORT
kubeadm version (use kubeadm version): 1.12.1
Environment:
kubectl version): 1.12.1uname -a): 4.15.0-36-genericAfter kubeadm init, and installing Calico, the coredns never recovers from crashloop and settles in a error state
the same installation method works in 1.11.0 and all pods are in running state.
install the latest kubernetes via kubeadm
@bravinash What is the output of kubectl -n kube-system describe pod <coredns-pod-name>?
i will try this later today, myself.
@bravinash i cannot confirm the issue:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-node-x2l7f 2/2 Running 0 93s
kube-system coredns-576cbf47c7-6qn92 1/1 Running 0 93s
kube-system coredns-576cbf47c7-vk2h5 1/1 Running 0 93s
kube-system etcd-luboitvbox 1/1 Running 0 41s
kube-system kube-apiserver-luboitvbox 1/1 Running 0 35s
kube-system kube-controller-manager-luboitvbox 1/1 Running 0 45s
kube-system kube-proxy-np5s4 1/1 Running 0 93s
kube-system kube-scheduler-luboitvbox 1/1 Running 0 58s
everything is 1.12.1, control-plane, kubeadm, kubelet.
please provide more details about your setup.
/priority awaiting-more-evidence
@bravinash can you show the log of the crashlooping pod: kubectl logs -n kube-system <coredns pod name>.
You can see core-dns pods this way: kubectl get pods -n kube-system |grep coredns
Reproduced it:
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS"
$ dpkg -l |grep kube
ii kubeadm 1.12.1-00 amd64 Kubernetes Cluster Bootstrapping Tool
ii kubectl 1.12.1-00 amd64 Kubernetes Command Line Tool
ii kubelet 1.12.1-00 amd64 Kubernetes Node Agent
ii kubernetes-cni 0.6.0-00 amd64 Kubernetes CNI
$ uname -a
Linux ed-ipv6-1 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# kubeadm init --pod-network-cidr=192.168.0.0/16
...
You can now join any number of machines by running the following on each node
as root:
...
$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$ sudo chown $(id -u):$(id -g) $HOME/.kube/config
$ kubectl apply -f https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/hosted/rbac-kdd.yaml
clusterrole.rbac.authorization.k8s.io/calico-node created
$ kubectl apply -f https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml
configmap/calico-config created
service/calico-typha created
deployment.apps/calico-typha created
daemonset.extensions/calico-node created
customresourcedefinition.apiextensions.k8s.io/felixconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgppeers.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/bgpconfigurations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/ippools.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/hostendpoints.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/clusterinformations.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworkpolicies.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/globalnetworksets.crd.projectcalico.org created
customresourcedefinition.apiextensions.k8s.io/networkpolicies.crd.projectcalico.org created
serviceaccount/calico-node created
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-node-q6k75 2/2 Running 0 2m20s
coredns-576cbf47c7-gk59b 0/1 CrashLoopBackOff 3 27m
coredns-576cbf47c7-vz5kc 0/1 CrashLoopBackOff 3 27m
etcd-ed-ipv6-1 1/1 Running 0 26m
kube-apiserver-ed-ipv6-1 1/1 Running 0 26m
kube-controller-manager-ed-ipv6-1 1/1 Running 0 26m
kube-proxy-8dw78 1/1 Running 0 27m
kube-scheduler-ed-ipv6-1 1/1 Running 0 26m
$ kubectl logs -n kube-system coredns-576cbf47c7-gk59b
.:53
2018/10/09 10:41:59 [INFO] CoreDNS-1.2.2
2018/10/09 10:41:59 [INFO] linux/amd64, go1.11, eb51e8b
CoreDNS-1.2.2
linux/amd64, go1.11, eb51e8b
2018/10/09 10:41:59 [INFO] plugin/reload: Running configuration MD5 = f65c4821c8a9b7b5eb30fa4fbc167769
2018/10/09 10:42:05 [FATAL] plugin/loop: Seen "HINFO IN
XXXXXXXXXXXXXXXXX.XXXXXXXXXXXXXXXX." more than twice, loop detected
it looks like it's caused by systemd-resolved:
# grep nameserver /etc/resolv.conf
nameserver 127.0.0.53
@bravinash please, confirm that the issue is the same on your setup.
If it's the same than you can try to point kubelet to the original resolv.conf this way:
$ echo -e '[Service]\nEnvironment="KUBELET_EXTRA_ARGS=--resolv-conf=/run/systemd/resolve/resolv.conf"\n' | sudo tee /etc/systemd/system/kubelet.service.d/99-local.conf
$ sudo systemctl daemon-reload
$ sudo systemctl restart kubelet
and remove coredns pods:
$ kubectl get pods -n kube-system -oname |grep coredns |xargs kubectl delete -n kube-system
Kubelet will start them again with new configuration.
At least this fixed the issue in my setup:
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-node-cpsqz 2/2 Running 0 5m20s
coredns-576cbf47c7-2prpv 1/1 Running 0 18s
coredns-576cbf47c7-xslbx 1/1 Running 0 18s
etcd-ed-ipv6-1 1/1 Running 0 17m
kube-apiserver-ed-ipv6-1 1/1 Running 0 17m
kube-controller-manager-ed-ipv6-1 1/1 Running 0 17m
kube-proxy-8dkqt 1/1 Running 0 17m
kube-scheduler-ed-ipv6-1 1/1 Running 0 17m
hi, @bart0sh
isn't --resolv-conf=/run/systemd/resolve/resolv.conf populated correctly on kubeadm init in /var/lib/kubelet/kubeadm-flags.env?
https://kubernetes.io/docs/setup/independent/kubelet-integration/#the-kubelet-drop-in-file-for-systemd
/remove-priority awaiting-more-evidence
@neolit123 it is:
$ cat /var/lib/kubelet/kubeadm-flags.env
KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni --resolv-conf=/run/systemd/resolve/resolv.conf
Thanks for pointing that out to me. It could be that the issue is different or original resolv.conf also triggers it.
strange, it works fine on my VM - it's a Ubuntu 17.10 and also has systemd-resolved.
On my system original resolv.conf also triggered the same issue. It started to work after copying it to another file, removing one of 3 nameserver lines from it and changing --resolv-conf option in /var/lib/kubelet/kubeadm-flags.env
Anyway, we need confirmation from @bravinash that the issue is the same.
It doesn't look like Kubeadm issue for me.
/var/lib/kubelet/kubeadm-flags.env is re-written each time kubeadm init runs, btw.
that was my point from earlier.
/kind bug
/sig network
/area ecosystem
This is very strange, I was consistently able reproduce this issue until now. I redeployed my Ubutnu 18.04 with 1.12.1 (kubelet kubeadm kubectl)
Now this is working..
root@k8s-1121:~# cat /etc/resolv.conf
#
#
#
#
nameserver 127.0.0.53
search exu.ericsson.se
root@k8s-1121:~# cat /var/lib/kubelet/kubeadm-flags.env
KUBELET_KUBEADM_ARGS=--cgroup-driver=systemd --network-plugin=cni --resolv-conf=/run/systemd/resolve/resolv.conf
root@k8s-1121:~# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-etcd-r5wrw 1/1 Running 0 6m52s
kube-system calico-kube-controllers-f4dcbf48b-7lvvn 1/1 Running 0 7m7s
kube-system calico-node-qw9kx 2/2 Running 2 7m7s
kube-system coredns-576cbf47c7-wdxv2 1/1 Running 0 7m52s
kube-system coredns-576cbf47c7-wjf2m 1/1 Running 0 7m52s
kube-system etcd-k8s-1121 1/1 Running 0 7m16s
kube-system kube-apiserver-k8s-1121 1/1 Running 0 7m14s
kube-system kube-controller-manager-k8s-1121 1/1 Running 0 7m8s
kube-system kube-proxy-g879x 1/1 Running 0 7m52s
kube-system kube-scheduler-k8s-1121 1/1 Running 0 7m
I will try few more things to reproduce this again..
Regards,
Avinash
From: Ed Bartosh notifications@github.com
Sent: Tuesday, October 9, 2018 6:56 AM
To: kubernetes/kubeadm kubeadm@noreply.github.com
Cc: Avinash Raghavendra avinash.raghavendra@ericsson.com; Mention mention@noreply.github.com
Subject: Re: [kubernetes/kubeadm] CoreDNS Crashloop Error - 1.12.1 (#1162)
On my system original resolv.conf also triggered the same issue. It started to work after copying it to another file, removing one of 3 nameserver lines from it and changing --resolv-conf option in /var/lib/kubelet/kubeadm-flags.env
Anyway, we need confirmation from @bravinashhttps://github.com/bravinash that the issue is the same.
It doesn't look like Kubeadm issue for me.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com/kubernetes/kubeadm/issues/1162#issuecomment-428163873, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AdD2Qj7-CqRCRK8rWeHWoULUMGlim2IQks5ujI7igaJpZM4XMP-f.
I'd propose to close this issue for 2 reasons:
@bravinash you can reopen this issue if you see it again
I will close this issue for now as this is not reproducible anymore.
FWIW (late to conversation, commenting on a closed issue), In CoreDNS this error is expected when a loop is detected... and the behavior is, as designed, to exit when this happens. K8s detects this as a "crash".
2018/10/09 10:41:59 [INFO] plugin/reload: Running configuration MD5 = f65c4821c8a9b7b5eb30fa4fbc167769
2018/10/09 10:42:05 [FATAL] plugin/loop: Seen "HINFO IN XXX.XXX." more than twice, loop detected
it's more coredns than Kubeadm issue
CoreDNS cannot automatically resolve a badly configured environment. So, the issue lies with whatever/whomever is responsible for configuring things.
FWIW, I get this exact same problem trying to create a kube master with kubeadm init on ubuntu 18.04.
If it is, indeed, systemd dns getting in the way, then that still needs some kind of work-around...
@jwatte unfortunately we weren't able to reproduce the problem. If you can show us logs of core-dns pod I'd be happy to go further with this issue.
@jwatte, Assuming you see the "loop detected" error in the logs, kubeadm should automatically detect the systemd-resolved issue, and set kubelet up with the right resolv.conf file. If it's not, you can try one of the manual work-arounds listed in CoreDNS loop plugin docs.
@jwatte
kubeadm should automatically detect the systemd-resolved issue, and set kubelet up with the right resolv.conf file
this is already done
Please, check if it's done in your setup. btw, which kubeadm version do you use?
I'm using the current released version of today for everything.
kubeadm version: &version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:43:08Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
However, the special thing I do is edit the kubelet start-up commands to use kubenet instead of gcn, because that's our prod configuration and I'd like to stay close.
However, no matter how I messed around with coredns configs and replication, I couldn't get it to not loop. I ended up wiping and setting up with gcn and flannel, and that works as expected.
I ended up wiping and setting up with gcn and flannel, and that works as expected.
I'm usually using Weave Net in my development setup and it works just fine. When I switched to calico to reproduce this bug core-dns pod started crashlooping because of loop detection. Removing one of name servers from resolv.conf fixed the issue for me.
So, my point is: until we're able to reproduce this and understand how to fix it I don't think we can go further with this. I'm still under impression that it's not a kubeadm issue. In my case it was infrastructure issue.
My take is that it is somewhere in the "documentation" and "diagnosability" section, which could be solved by kubernetes. When I get this error: Then what? What causes it? What are common problems? What can I do to gain further insight into the causes of this problem? Googling the error message and getting this thread as the main reference shows that that need is not yet solved by the kubernetes overall offering.
Separately: Why would two name servers in resolv.conf cause this problem? That sems like a totally legitimate and common configuration which should be tolerated by kubernetes components.
Why would two name servers in resolv.conf cause this problem?
There were 3 name servers in my resolv.conf. Removing one particular server solved the issue. Removing any other didn't. So, I decided that the loop was somehow created by that server.
My take is that it is somewhere in the "documentation" and "diagnosability" section, which could be solved by kubernetes. When I get this error: Then what? What causes it? What are common problems?
There is this ... a recently added section on troubleshooting in loop plugin. The latest version of CoreDNS points to this in the error message.
I just noticed that our README.md -> webpage.html conversion process is buggy, so some of the formatting is botched up on the coredns.io version. The original is here... https://github.com/coredns/coredns/blob/master/plugin/loop/README.md
also as a reference (if not posted already)
the kubeadm troubleshooting page has a section related to CrashLoopBackoff of CoreDNS caused by SELinux:
https://kubernetes.io/docs/setup/independent/troubleshooting-kubeadm/#coredns-pods-have-crashloopbackoff-or-error-state
kubectl -n kube-system describe pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 114s default-scheduler Successfully assigned kube-system/coredns-5b4dd968b9-mqhvc to cap166
Normal Pulled 49s (x4 over 114s) kubelet, cap166 Container image "k8s.gcr.io/coredns:1.2.2" already present on machine
Normal Created 49s (x4 over 114s) kubelet, cap166 Created container
Normal Started 49s (x4 over 114s) kubelet, cap166 Started container
Warning BackOff 9s (x8 over 100s) kubelet, cap166 Back-off restarting failed container
I shared the solution that has worked for me here: https://stackoverflow.com/a/53414041/1005102
@utkuozdemir
ps auxww | grep kubelet
you might see a line like:
/usr/bin/kubelet ... --resolv-conf=/run/systemd/resolve/resolv.conf
alternatively:
cat /var/lib/kubelet/kubeadm-flags.env
kubeadm writes the --resolv-conf flag in that file each time it runs join and init.
The CoreDNS loop error just happened to me with Ubuntu 18.04/kubeadm when i've specified the node's IP like this:
cat > /etc/default/kubelet <
EOF
Moving the --node-ip command to the /etc/systemd/system/kubelet.service.d/10-kubeadm.conf solved the loop issue:
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS --node-ip=192.168.10.11
(the content of the /etc/default/kubelet file is now "KUBELET_EXTRA_ARGS=")
my core-dns is on CrashLoopBackOff.
the log is:
E0401 16:54:47.487196 1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-fb8b8dccf-v6wkw.unknownuser.log.ERROR.20190401-165447.1: no such file or directory
I have ubuntu 18 and kubernetes v1.14 + flannel, all other pods are running
@pmehdinejad, That error means that your kubernetes API is not reachable.
@pmehdinejad, That error means that your kubernetes API is not reachable.
I think that one is for before I setup flannel, the most recent error is this one:
log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-fb8b8dccf-v6wkw.unknownuser.log.ERROR.20190401-165447.1: no such file or directory
coredns version?
coredns version?
k8s.gcr.io/coredns:1.3.1
Try 1.4.0 to see if that behaves any differently. Could be related to the glog/klog issue we fixed in 1.4.0.
Try 1.4.0 to see if that behaves any differently. Could be related to the glog/klog issue we fixed in 1.4.0.
Seems to be fixed with 1.4
Appreciate the help
@pmehdinejad, That error means that your kubernetes API is not reachable.
I think that one is for before I setup flannel, the most recent error is this one:
log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-fb8b8dccf-v6wkw.unknownuser.log.ERROR.20190401-165447.1: no such file or directory
Still getting this error:
Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
Weird because all other pods are running including kube-proxy , kube-apiserver, and flannel
@pmehdinejad, the error means that CoreDNS cannot reach the Kubernetes API. It may be something wrong with flannel or kube-proxy (which makes the connections possible).
@pmehdinejad, the error means that CoreDNS cannot reach the Kubernetes API. It may be something wrong with flannel or kube-proxy (which makes the connections possible).
Yes I was doing telnet from the host, I tried to deploy a pod and run it from it's shell, but most of the images are minimal and without coredns I cannot install anything in the container.
It's weird because both flannel and proxy are working and all I can see in their logs are these:
Flannel:
I0402 14:24:26.014844 1 iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
I0402 14:24:26.113264 1 iptables.go:155] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I0402 14:24:26.115211 1 iptables.go:155] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.1.0/24 -j RETURN
I0402 14:24:26.117448 1 iptables.go:155] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully
kube-proxy:
E0402 15:11:19.584081 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized
E0402 15:11:20.585880 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: Unauthorized
E0402 15:11:20.586590 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Unauthorized
I'm not kube-proxy expert, but I don't think those error messages are normal for kube-proxy.
I met similar problem. The coredns crashloop and also caused dashboard crashloop. It works after I set the network to flannel, and all of the PC hostname in /etc/hosts.
I have 3 PCs and I configured all three PCs as both master and node. One PC is behind my home gateway, another two behind corp firewall. All three PC are connected within a VPN. If the default network setting were used by kubespray, the coredns pod works when it's deployed on the PC in my home, and it crashed if not.
I tried the followings:
So is it the problem of the default network calico?
Most helpful comment
@bravinash please, confirm that the issue is the same on your setup.
If it's the same than you can try to point kubelet to the original resolv.conf this way:
and remove coredns pods:
Kubelet will start them again with new configuration.
At least this fixed the issue in my setup: