coredns are crashing due to loop detection

Created on 12 Dec 2018  Â·  6Comments  Â·  Source: coredns/coredns

Hello

We face very strange issue with failing coredns in kubernetes:
corednes is always hitting state : CrashLoopBackOff

coredns version: 1.2.2
kubernetes version: v1.12.3
docker version: 18.06.1-ce
OS: CentOS Linux release 7.5.1804 (Core)
CNI: weave 2.5.0

When we bootstraps kubernetes with kubeadm everything is working fine, coredns pods are up and running and kube-dns is working as expected. Once we reboot server, coredns pods starts crashing with following message in logs:

```[root@qa065 ~]# kubectl logs coredns-576cbf47c7-6vxd4 -n kube-system
.:53
2018/12/12 13:33:16 [INFO] CoreDNS-1.2.2
2018/12/12 13:33:16 [INFO] linux/amd64, go1.11, eb51e8b
CoreDNS-1.2.2
linux/amd64, go1.11, eb51e8b
2018/12/12 13:33:16 [INFO] plugin/reload: Running configuration MD5 = f65c4821c8a9b7b5eb30fa4fbc167769
2018/12/12 13:33:22 [FATAL] plugin/loop: Seen "HINFO IN 7087784449798295848.7359092265978106814." more than twice, loop detected


We determined that loop plugin of coredns deceted loop and therefore exited, but we are not able to find where this loop is. In other words, there is no loop anywhere defined regarding DNS on host system.
 - we are not using `systemd-resolved` at all.
 - our kubelet service is using original `/etc/resolv.conf` file
 - our `/etc/resolv.conf` file does not contain nothing regarding : `localhost, 127.0.0.0/53, :::1`

Our coredns cm is following:
```[root@qa065 ~]# kubectl describe cm coredns -n kube-system
Name:         coredns
Namespace:    kube-system
Labels:       <none>
Annotations:  <none>

Data
====
Corefile:
----
.:53 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       upstream
       fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    proxy . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
}

Events:  <none>

When we remove loop from coredns cm, coredns pods starts and running without problems but inter-pods communication stop working (kube-dns lost ep and is not able to resolve service names to ips). for ex. we have 2 pods which needs to communicate together (prometheus server + grafana) and this is not working after loop is removed from cm.

We have also tried to:

  • exclude localhost (for sure) from upstream DNS:
proxy . /etc/resolv.conf {
     exclude 127.0.0.0/8
}
  • adding DNS server IP instead of /etc/resolv.conf file in coredns cm
    proxy . <DNS_IP_ADDR>

  • checked kubelet configuration:
    [root@qa078 network-scripts]# cat /var/lib/kubelet/config.yaml | grep resolv resolvConf: /etc/resolv.conf

Any suggestions/ideas would be really appreciated.

plugiloop question works as intended

Most helpful comment

There was a bug where loop detected a loop when it's upstream wasn't working (essentially seeing its own retries). Can you upgrade to 1.2.6 which at least has that bug fixed?

All 6 comments

There was a bug where loop detected a loop when it's upstream wasn't working (essentially seeing its own retries). Can you upgrade to 1.2.6 which at least has that bug fixed?

@miekg I am working with @JaroVojtek could you please advise how to redefine coredns version requested by kubeadm ?

@miekg we have tried to change version to 1.2.6 and all pods have started successfully but after server reboot again

NAME                                           READY   STATUS             RESTARTS   AGE
coredns-576cbf47c7-lbpmj                       0/1     CrashLoopBackOff   17         102m
coredns-576cbf47c7-tkjhn                       0/1     CrashLoopBackOff   17         102m
etcd-qacom                      1/1     Running            1          101m
kube-apiserver-qacom            1/1     Running            1          101m
kube-controller-manager-qacom   1/1     Running            1          101m
kube-proxy-hjq6b                               1/1     Running            1          102m
kube-scheduler-qacom            1/1     Running            1          101m
tiller-deploy-694dc94c65-v4lb7                 1/1     Running            1          102m
weave-net-8mdrl                                2/2     Running            4          102m

we figured out an interesting fact:
When we run curl pod inside kube-system namespace - we can not get response from
nameserver
which is strange because we can easily do http requset via curl to the service running on localhost

[root@qa ~]# kubectl run curl-pod -n kube-system --image=radial/busyboxplus:curl -i --tty --rm
kubectl run --generator=deployment/apps.v1beta1 is DEPRECATED and will be removed in a future version. Use kubectl create instead.
If you don't see a command prompt, try pressing enter.
[ root@curl-pod-7d7554d866-gvrw7:/ ]$ 
[ root@curl-pod-7d7554d866-gvrw7:/ ]$ 
[ root@curl-pod-7d7554d866-gvrw7:/ ]$ 
[ root@curl-pod-7d7554d866-gvrw7:/ ]$ telnet 10.x.y.10 53
^C
[ root@curl-pod-7d7554d866-gvrw7:/ ]$ telnet 8.8.8.8 53
^C

http request

[ root@curl-pod-7d7554d866-gvrw7:/ ]$ curl 10.y.x.z:8077/pathx
{"timestamp":"2018-12-12T17:53:57.020+0000","status":401,"error":"Unauthorized","message":"Full authentication is required to access this resource","path":"/pathx"}

Unrelated to proving a solution, but we should log who the hell is asking
this, i.e log the source IP.

On Wed, 12 Dec 2018, 17:51 Jan Toth <[email protected] wrote:

@miekg https://github.com/miekg we have tried to change version to
1.2.6 and all pods have started successfully but after server reboot
again

NAME READY STATUS RESTARTS AGE
coredns-576cbf47c7-lbpmj 0/1 CrashLoopBackOff 17 102m
coredns-576cbf47c7-tkjhn 0/1 CrashLoopBackOff 17 102m
etcd-qacom 1/1 Running 1 101m
kube-apiserver-qacom 1/1 Running 1 101m
kube-controller-manager-qacom 1/1 Running 1 101m
kube-proxy-hjq6b 1/1 Running 1 102m
kube-scheduler-qacom 1/1 Running 1 101m
tiller-deploy-694dc94c65-v4lb7 1/1 Running 1 102m
weave-net-8mdrl 2/2 Running 4 102m

we figured out an interesting fact:
When we run curl pod inside kube-system namespace - we can not get
response from
nameserver
which is strange because we can easily do http requset via curl to the
service running on localhost

[root@qa ~]# kubectl run curl-pod -n kube-system --image=radial/busyboxplus:curl -i --tty --rm
kubectl run --generator=deployment/apps.v1beta1 is DEPRECATED and will be removed in a future version. Use kubectl create instead.
If you don't see a command prompt, try pressing enter.
[ root@curl-pod-7d7554d866-gvrw7:/ ]$
[ root@curl-pod-7d7554d866-gvrw7:/ ]$
[ root@curl-pod-7d7554d866-gvrw7:/ ]$
[ root@curl-pod-7d7554d866-gvrw7:/ ]$ telnet 10.x.y.10 53
^C
[ root@curl-pod-7d7554d866-gvrw7:/ ]$ telnet 8.8.8.8 53
^C

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/coredns/coredns/issues/2391#issuecomment-446680135,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAVkW6_mKnJZQvAhXmx8-H65konjfVHVks5u4UIcgaJpZM4ZPe_d
.

I am not sure what you meant by your last comment, I am just trying to provide some more info to investigation of this issue.

Filed #2395 to detail what I meant.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

NadamHL picture NadamHL  Â·  5Comments

miekg picture miekg  Â·  4Comments

qrr1995 picture qrr1995  Â·  3Comments

jpds picture jpds  Â·  5Comments

SuperQ picture SuperQ  Â·  5Comments