Kubeadm: coredns fails with invalid kube-api endpoint

Created on 20 Nov 2018 · 27Comments · Source: kubernetes/kubeadm

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):
kubeadm version: &version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:51:33Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:54:59Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:43:59Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
OpenStack
OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.1 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Kernel (e.g. uname -a):
Linux kube-apiserver-1 4.15.0-39-generic #42-Ubuntu SMP Tue Oct 23 15:48:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Others:

What happened?

$ kubectl -n kube-system get pods

NAME                                       READY     STATUS    RESTARTS   AGE
coredns-576cbf47c7-2vh8c                   1/1       Running   167        13h
coredns-576cbf47c7-q88fm                   1/1       Running   167        13h
kube-apiserver-kube-apiserver-1            1/1       Running   0          13h
kube-controller-manager-kube-apiserver-1   1/1       Running   2          13h
kube-flannel-ds-amd64-bmvs9                1/1       Running   0          13h
kube-proxy-dkkqs                           1/1       Running   0          13h
kube-scheduler-kube-apiserver-1            1/1       Running   2          13h

$ kubectl -n kube-system logs coredns-576cbf47c7-2vh8c

E1120 16:31:29.672203       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:355: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:31:29.672382       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:348: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:31:29.673053       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:350: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:32:00.672931       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:355: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:32:00.681605       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:348: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:32:00.682868       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:350: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout

What you expected to happen?

Healthy coredns after installation of pod-network

How to reproduce it (as minimally and precisely as possible)?

Install Kubernetes using kubeadm following instructions on: https://kubernetes.io/docs/setup/independent/high-availability/
Download flannel pod network add-on from: https://raw.githubusercontent.com/coreos/flannel/bc79dd1505b0c8681ece4de4c0d86c5cd2643275/Documentation/kube-flannel.yml
Add the following environment variables to kube-flannel.yml:

...
- name: KUBERNETES_SERVICE_HOST
value: "kube-apiserver.config-service.com" #ip address of the host where kube-apiservice is running
- name: KUBERNETES_SERVICE_PORT
value: "6443"
...

Apply modified kube-flannel.yml to kubernetes: kubectl apply -f kube-flannel.yml

Anything else we need to know?

The external load balancer endpoint is: kube-apiserver.config-service.com
this has been configured to be a TCP pass through for port 6443 which works well for a three master nodes.

$ kubectl get nodes

NAME               STATUS    ROLES     AGE       VERSION
kube-apiserver-1   Ready     master    13h       v1.12.2
kube-apiserver-2   Ready     master    2m        v1.12.2
kube-apiserver-3   Ready     master    14s       v1.12.2

areHA kindocumentation prioritawaiting-more-evidence

Source

dannymk

Most helpful comment

OK, for those of you that get to this point and are frustrated TRUST ME when I tell you that is is easier to destroy the etcd cluster and start from scratch than try to figure out this network issue. After three days I got tired of debugging and troubleshooting and just decided to start from scratch which I had done a few times with the masters and the nodes however I left the etcd cluster intact.

I decided to break down and destroy EVERYTHING including the etcd cluster and guess what? Now I have a fully working cluster:

$ kubectl -n kube-system get pods
NAME                                       READY     STATUS    RESTARTS   AGE
coredns-576cbf47c7-5tlwd                   1/1       Running   0          37m
coredns-576cbf47c7-vsj2z                   1/1       Running   0          37m
kube-apiserver-kube-apiserver-1            1/1       Running   0          37m
kube-apiserver-kube-apiserver-2            1/1       Running   0          15m
kube-apiserver-kube-apiserver-3            1/1       Running   0          14m
kube-controller-manager-kube-apiserver-1   1/1       Running   0          37m
kube-controller-manager-kube-apiserver-2   1/1       Running   0          15m
kube-controller-manager-kube-apiserver-3   1/1       Running   0          14m
kube-flannel-ds-amd64-2dtln                1/1       Running   0          8m
kube-flannel-ds-amd64-75bgw                1/1       Running   0          10m
kube-flannel-ds-amd64-cpcjv                1/1       Running   0          35m
kube-flannel-ds-amd64-dlwww                1/1       Running   0          8m
kube-flannel-ds-amd64-dwkjb                1/1       Running   1          15m
kube-flannel-ds-amd64-msx9l                1/1       Running   0          14m
kube-flannel-ds-amd64-smhfj                1/1       Running   0          9m
kube-proxy-5rdk7                           1/1       Running   0          10m
kube-proxy-8gfd7                           1/1       Running   0          9m
kube-proxy-9kfxv                           1/1       Running   0          37m
kube-proxy-c22dl                           1/1       Running   0          8m
kube-proxy-gkvz5                           1/1       Running   0          14m
kube-proxy-pxlrp                           1/1       Running   0          15m
kube-proxy-vmp5h                           1/1       Running   0          8m
kube-scheduler-kube-apiserver-1            1/1       Running   0          37m
kube-scheduler-kube-apiserver-2            1/1       Running   0          15m
kube-scheduler-kube-apiserver-3            1/1       Running   0          14m
It would have been a lot easier to start from scratch and avoid ALL these hours of troubleshooting than try to figure out this network problem. I just wish there was a script/method to reset the etcd DB so we did not have to rebuild from scratch. That would be an awesome tool, something like: kubeadm reset --etcd=https://etcd-cluster.control-service.com:2379

@neolit123 What do you think about an etcd reset option for an external etcd cluster?

haha! I know which the problem occured!

ube-system   calico-etcd-h8h46                          1/1     Running   1          17h
kube-system   calico-kube-controllers-85cf9c8b79-q78b2   1/1     Running   2          17h
kube-system   calico-node-5pvsw                          2/2     Running   2          17h
kube-system   calico-node-7xvn9                          2/2     Running   2          17h
kube-system   calico-node-85j5x                          2/2     Running   3          17h
kube-system   coredns-576cbf47c7-cw8lr                   1/1     Running   1          17h
kube-system   coredns-576cbf47c7-hvt7z                   1/1     Running   1          17h
kube-system   etcd-k8s-node131                           1/1     Running   1          17h
kube-system   kube-apiserver-k8s-node131                 1/1     Running   1          17h
kube-system   kube-controller-manager-k8s-node131        1/1     Running   1          17h
kube-system   kube-proxy-458vk                           1/1     Running   1          17h
kube-system   kube-proxy-n852v                           1/1     Running   1          17h
kube-system   kube-proxy-p5d5g                           1/1     Running   1          17h
kube-system   kube-scheduler-k8s-node131                 1/1     Running   1          17h
kube-system   traefik-ingress-controller-fkhwk           1/1     Running   0          18m
kube-system   traefik-ingress-controller-kxr6v           1/1     Running   0          18m

Now,the cluster is healthy.
reason:
deploy a cluster:

1. kubeadm init
2.deploy network plugin
3.kubeadm join

I mixed step2 and step3!
if someone have the same problem,you can try it.

liaoqiArno on 23 Nov 2018

👍8

All 27 comments

Looking at issue #1264

$ kubectl get svc

NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   2d

Ah!

$ kubectl delete svc/kubernetes
service "kubernetes" deleted

$ kubectl get svc
No resources found.

$ kubectl -n kube-system get pods

NAME                                       READY     STATUS             RESTARTS   AGE
coredns-576cbf47c7-2vh8c                   1/1       Running            189        15h
coredns-576cbf47c7-q88fm                   0/1       CrashLoopBackOff   189        15h
kube-apiserver-kube-apiserver-1            1/1       Running            0          15h
...

Arrggghhhh... such a simple solution. I bet I did not even have to use the modifications I made to the flannel template but I am not testing that today.

Can this go into the documentation for an external HA configuration?

Thank you for your support.

dannymk on 20 Nov 2018

Can this go into the documentation for an external HA configuration? or at least as part of the "Troubleshooting kubeadm" page: https://kubernetes.io/docs/setup/independent/troubleshooting-kubeadm

dannymk on 20 Nov 2018

Looking at issue #1264

@dannymk
hm, it's the same issue as the issue we are currently typing in.

neolit123 on 20 Nov 2018

https://github.com/kubernetes/kubeadm/issues/193#issuecomment-434660423
is this the one?

neolit123 on 20 Nov 2018

Sorry, yes, that was the one that fixed the initial problem. Celebrated too soon, I was kind of excited to see it go into service only to fail its liveness probe. I think I saw an issue on this:

$ kubectl -n kube-system describe pod/coredns-576cbf47c7-4gdnc

...
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 20 Nov 2018 14:28:58 -0500
      Finished:     Tue, 20 Nov 2018 14:31:17 -0500
    Ready:          False
    Restart Count:  8
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-kb77k (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-kb77k:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-kb77k
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                From                       Message
  ----     ------     ----               ----                       -------
  Normal   Scheduled  23m                default-scheduler          Successfully assigned kube-system/coredns-576cbf47c7-4gdnc to kube-apiserver-3
  Normal   Pulled     19m (x3 over 23m)  kubelet, kube-apiserver-3  Container image "k8s.gcr.io/coredns:1.2.2" already present on machine
  Normal   Created    19m (x3 over 23m)  kubelet, kube-apiserver-3  Created container
  Normal   Killing    19m (x2 over 21m)  kubelet, kube-apiserver-3  Killing container with id docker://coredns:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Started    19m (x3 over 23m)  kubelet, kube-apiserver-3  Started container
  Warning  Unhealthy  3m (x36 over 22m)  kubelet, kube-apiserver-3  Liveness probe failed: HTTP probe failed with statuscode: 503

I also notice:

$ kubectl -n kube-system logs coredns-576cbf47c7-4gdnc

...
E1120 19:30:30.998763       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:350: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
2018/11/20 19:30:47 [INFO] SIGTERM: Shutting down servers then terminating
E1120 19:31:01.997492       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:348: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 19:31:01.998285       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:355: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 19:31:01.999754       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:350: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout

Which may have to do with the svc being created in the default namespace:

$ kubectl get svc

NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   28m

That service is recreated every time I delete it so...

dannymk on 20 Nov 2018

Can https://github.com/apprenda/kismatic/issues/408 have something to do with this?

dannymk on 20 Nov 2018

not sure, i asked if someone have seen something like that on slack and they said no.
we are currently busy with some 1.13 items.

neolit123 on 20 Nov 2018

You know, maybe it is a good idea to post my initialization file:

apiVersion: kubeadm.k8s.io/v1alpha3
kind: ClusterConfiguration
kubernetesVersion: stable
apiServerCertSANs:
- "kube-apiserver.config-service.com"
controlPlaneEndpoint: "kube-apiserver.config-service.com:6443"
etcd:
   external:
       endpoints:
       - http://etcd-cluster.config-service.com:2379
networking:
  podSubnet: "10.253.0.0/16"

dannymk on 20 Nov 2018

Hmmm... I have a nagging suspicion that is I modify the kubernetes service to reflect the external load balancer instead of a clusterIP things would work.

dannymk on 20 Nov 2018

@neolit123 What creates the kubernetes service in the default namespace?

NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   28m

dannymk on 21 Nov 2018

What creates the kubernetes service in the default namespace?

a controller manager.

neolit123 on 21 Nov 2018

It seems that although EVERYTHING seems to be working except the networking portion between pods. I suspect that the flannel networking only works because I set the env variables:

...
- name: KUBERNETES_SERVICE_HOST
value: "kube-apiserver.config-service.com" #The external load balancer
- name: KUBERNETES_SERVICE_PORT
value: "6443"
...

in the flannel configuration.

It has been suggested that I make sure IP forwarding is enabled at the KVM level on all nodes which it is:

$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1

However I have not checked ip forwarding on the actual hosts. Could this have something to do with it?

dannymk on 22 Nov 2018

I meet the same error!!
who can resolve it........

liaoqiArno on 22 Nov 2018

Hmmm... I noticed this in all the kube-proxy pods:

W1122 11:15:14.698440       1 server_others.go:295] Flag proxy-mode="" unknown, assuming iptables proxy
W1122 11:15:14.719388       1 server.go:604] Failed to retrieve node info: Unauthorized
I1122 11:15:14.719425       1 server_others.go:148] Using iptables Proxier.
W1122 11:15:14.719577       1 proxier.go:312] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
I1122 11:15:14.720425       1 server_others.go:178] Tearing down inactive rules.
I1122 11:15:15.040385       1 server.go:447] Version: v1.12.2
I1122 11:15:15.053247       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I1122 11:15:15.053647       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I1122 11:15:15.053804       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I1122 11:15:15.053972       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I1122 11:15:15.054554       1 config.go:102] Starting endpoints config controller
I1122 11:15:15.054571       1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
I1122 11:15:15.054621       1 config.go:202] Starting service config controller
I1122 11:15:15.054640       1 controller_utils.go:1027] Waiting for caches to sync for service config controller
E1122 11:15:15.059301       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Unauthorized
E1122 11:15:15.059495       1 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-node-3.15696e04fcf91c76", GenerateName:"", 
Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), Dele
tionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]stri
ng(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"kube-node-3", UID:"kube-node-3", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message
:"Starting kube-proxy.", Source:v1.EventSource{Component:"kube-proxy", Host:"kube-node-3"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbef5c1d0c33b3e76, ext:420697202, loc:(*time.Location)(0x20dcf40)}
}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbef5c1d0c33b3e76, ext:420697202, loc:(*time.Location)(0x20dcf40)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*ti
me.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Unauthorized' (will not retry!)
E1122 11:15:15.059631       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Service: Unauthorized
E1122 11:15:16.063451       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Unauthorized
E1122 11:15:16.064363       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Service: Unauthorized
E1122 11:15:17.067369       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Unauthorized
E1122 11:15:17.067578       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Service: Unauthorized
E1122 11:15:18.070488       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Unauthorized
E1122 11:15:18.071010       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Service: Unauthorized
E1122 11:15:19.073997       1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Unauthorized

dannymk on 22 Nov 2018

# Generated by iptables-save v1.6.1 on Thu Nov 22 18:13:45 2018
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [17:1020]
:POSTROUTING ACCEPT [17:1020]
:DOCKER - [0:0]
:KUBE-MARK-DROP - [0:0]
:KUBE-MARK-MASQ - [0:0]
:KUBE-POSTROUTING - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -s 10.253.0.0/16 -d 10.253.0.0/16 -j RETURN
-A POSTROUTING -s 10.253.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
-A POSTROUTING ! -s 10.253.0.0/16 -d 10.253.0.0/24 -j RETURN
-A POSTROUTING ! -s 10.253.0.0/16 -d 10.253.0.0/16 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
COMMIT
# Completed on Thu Nov 22 18:13:45 2018
# Generated by iptables-save v1.6.1 on Thu Nov 22 18:13:45 2018
*filter
:INPUT ACCEPT [6234102:1969097221]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [7099913:2105396669]
:DOCKER - [0:0]
:DOCKER-ISOLATION-STAGE-1 - [0:0]
:DOCKER-ISOLATION-STAGE-2 - [0:0]
:DOCKER-USER - [0:0]
:KUBE-FIREWALL - [0:0]
-A INPUT -j KUBE-FIREWALL
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A FORWARD -s 10.253.0.0/16 -j ACCEPT
-A FORWARD -d 10.253.0.0/16 -j ACCEPT
-A OUTPUT -j KUBE-FIREWALL
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
COMMIT
# Completed on Thu Nov 22 18:13:45 2018

dannymk on 22 Nov 2018

OK, for those of you that get to this point and are frustrated TRUST ME when I tell you that is is easier to destroy the etcd cluster and start from scratch than try to figure out this network issue. After three days I got tired of debugging and troubleshooting and just decided to start from scratch which I had done a few times with the masters and the nodes however I left the etcd cluster intact.

I decided to break down and destroy EVERYTHING including the etcd cluster and guess what? Now I have a fully working cluster:

$ kubectl -n kube-system get pods

NAME                                       READY     STATUS    RESTARTS   AGE
coredns-576cbf47c7-5tlwd                   1/1       Running   0          37m
coredns-576cbf47c7-vsj2z                   1/1       Running   0          37m
kube-apiserver-kube-apiserver-1            1/1       Running   0          37m
kube-apiserver-kube-apiserver-2            1/1       Running   0          15m
kube-apiserver-kube-apiserver-3            1/1       Running   0          14m
kube-controller-manager-kube-apiserver-1   1/1       Running   0          37m
kube-controller-manager-kube-apiserver-2   1/1       Running   0          15m
kube-controller-manager-kube-apiserver-3   1/1       Running   0          14m
kube-flannel-ds-amd64-2dtln                1/1       Running   0          8m
kube-flannel-ds-amd64-75bgw                1/1       Running   0          10m
kube-flannel-ds-amd64-cpcjv                1/1       Running   0          35m
kube-flannel-ds-amd64-dlwww                1/1       Running   0          8m
kube-flannel-ds-amd64-dwkjb                1/1       Running   1          15m
kube-flannel-ds-amd64-msx9l                1/1       Running   0          14m
kube-flannel-ds-amd64-smhfj                1/1       Running   0          9m
kube-proxy-5rdk7                           1/1       Running   0          10m
kube-proxy-8gfd7                           1/1       Running   0          9m
kube-proxy-9kfxv                           1/1       Running   0          37m
kube-proxy-c22dl                           1/1       Running   0          8m
kube-proxy-gkvz5                           1/1       Running   0          14m
kube-proxy-pxlrp                           1/1       Running   0          15m
kube-proxy-vmp5h                           1/1       Running   0          8m
kube-scheduler-kube-apiserver-1            1/1       Running   0          37m
kube-scheduler-kube-apiserver-2            1/1       Running   0          15m
kube-scheduler-kube-apiserver-3            1/1       Running   0          14m

It would have been a lot easier to start from scratch and avoid ALL these hours of troubleshooting than try to figure out this network problem. I just wish there was a script/method to reset the etcd DB so we did not have to rebuild from scratch. That would be an awesome tool, something like: kubeadm reset --etcd=https://etcd-cluster.control-service.com:2379

@neolit123 What do you think about an etcd reset option for an external etcd cluster?

dannymk on 22 Nov 2018

sorry for your troubles. i think your database got in a corrupted state for some reason and keeping it around was probably not a good idea. also this is hard to debug...

there are already ways to reset etcd (but on local nodes):
https://groups.google.com/forum/#!topic/coreos-user/qcwLNqou4qQ

also sig-cluster-lifecycle (the maintainers of kubeadm)
are working of a tool called: etcdadm that will most likely have this functionality.

neolit123 on 23 Nov 2018

👍2

OK, for those of you that get to this point and are frustrated TRUST ME when I tell you that is is easier to destroy the etcd cluster and start from scratch than try to figure out this network issue. After three days I got tired of debugging and troubleshooting and just decided to start from scratch which I had done a few times with the masters and the nodes however I left the etcd cluster intact.

I decided to break down and destroy EVERYTHING including the etcd cluster and guess what? Now I have a fully working cluster:

$ kubectl -n kube-system get pods
NAME                                       READY     STATUS    RESTARTS   AGE
coredns-576cbf47c7-5tlwd                   1/1       Running   0          37m
coredns-576cbf47c7-vsj2z                   1/1       Running   0          37m
kube-apiserver-kube-apiserver-1            1/1       Running   0          37m
kube-apiserver-kube-apiserver-2            1/1       Running   0          15m
kube-apiserver-kube-apiserver-3            1/1       Running   0          14m
kube-controller-manager-kube-apiserver-1   1/1       Running   0          37m
kube-controller-manager-kube-apiserver-2   1/1       Running   0          15m
kube-controller-manager-kube-apiserver-3   1/1       Running   0          14m
kube-flannel-ds-amd64-2dtln                1/1       Running   0          8m
kube-flannel-ds-amd64-75bgw                1/1       Running   0          10m
kube-flannel-ds-amd64-cpcjv                1/1       Running   0          35m
kube-flannel-ds-amd64-dlwww                1/1       Running   0          8m
kube-flannel-ds-amd64-dwkjb                1/1       Running   1          15m
kube-flannel-ds-amd64-msx9l                1/1       Running   0          14m
kube-flannel-ds-amd64-smhfj                1/1       Running   0          9m
kube-proxy-5rdk7                           1/1       Running   0          10m
kube-proxy-8gfd7                           1/1       Running   0          9m
kube-proxy-9kfxv                           1/1       Running   0          37m
kube-proxy-c22dl                           1/1       Running   0          8m
kube-proxy-gkvz5                           1/1       Running   0          14m
kube-proxy-pxlrp                           1/1       Running   0          15m
kube-proxy-vmp5h                           1/1       Running   0          8m
kube-scheduler-kube-apiserver-1            1/1       Running   0          37m
kube-scheduler-kube-apiserver-2            1/1       Running   0          15m
kube-scheduler-kube-apiserver-3            1/1       Running   0          14m
It would have been a lot easier to start from scratch and avoid ALL these hours of troubleshooting than try to figure out this network problem. I just wish there was a script/method to reset the etcd DB so we did not have to rebuild from scratch. That would be an awesome tool, something like: kubeadm reset --etcd=https://etcd-cluster.control-service.com:2379

@neolit123 What do you think about an etcd reset option for an external etcd cluster?

haha! I know which the problem occured!

ube-system   calico-etcd-h8h46                          1/1     Running   1          17h
kube-system   calico-kube-controllers-85cf9c8b79-q78b2   1/1     Running   2          17h
kube-system   calico-node-5pvsw                          2/2     Running   2          17h
kube-system   calico-node-7xvn9                          2/2     Running   2          17h
kube-system   calico-node-85j5x                          2/2     Running   3          17h
kube-system   coredns-576cbf47c7-cw8lr                   1/1     Running   1          17h
kube-system   coredns-576cbf47c7-hvt7z                   1/1     Running   1          17h
kube-system   etcd-k8s-node131                           1/1     Running   1          17h
kube-system   kube-apiserver-k8s-node131                 1/1     Running   1          17h
kube-system   kube-controller-manager-k8s-node131        1/1     Running   1          17h
kube-system   kube-proxy-458vk                           1/1     Running   1          17h
kube-system   kube-proxy-n852v                           1/1     Running   1          17h
kube-system   kube-proxy-p5d5g                           1/1     Running   1          17h
kube-system   kube-scheduler-k8s-node131                 1/1     Running   1          17h
kube-system   traefik-ingress-controller-fkhwk           1/1     Running   0          18m
kube-system   traefik-ingress-controller-kxr6v           1/1     Running   0          18m

Now,the cluster is healthy.
reason:
deploy a cluster:

1. kubeadm init
2.deploy network plugin
3.kubeadm join

I mixed step2 and step3!
if someone have the same problem,you can try it.

liaoqiArno on 23 Nov 2018

👍8

/milestone clear

spiffxp on 20 Feb 2019

I did the same. But still not solved. The odds things is I have two clusters with same configuration. Same ansible-playbook with kubeadm to spin up the cluster, but one works, one doesn't

dolphyvn on 22 Jul 2019

have the same problem here ... on 1.15.2

matthewygf on 10 Oct 2019

👍2

@matthewygf same. I'm running with an HA etcd setup for kubeadm. first master comes up fine but as i add masters, my node networking pods (calico-node) dont come up and the kube-proxy starts spitting out the cant list endpoints. I feel like i am missing some configuration somewhere.

AlexMorreale on 16 Oct 2019

@AlexMorreale we ended up bouncing the coredns pods to another node and it worked.
one other thing we changed, not sure whether it made an impact was, we found out our apiserver pod was on hostNetwork, but dnsConfig was not ClusterFirstWithHostNet.

matthewygf on 16 Oct 2019

👍1

@matthewygf thanks for pinging me back. ill try some of that stuff. i think have a lead otherwise but that is very helpful. Thanks!!!!!

AlexMorreale on 17 Oct 2019

I have checked my env and figure it out. It is because the node which the unavailable coredns service works on doesn't have the route to 10.233.0.1(apiserver svc ip addr in my env). It worked fine when i add a static route on this node. But the root cause i think is the ipvs module that works unproperly. Hope help others.

pstrive on 21 Oct 2019

I have checked my env and figure it out. It is because the node which the unavailable coredns service works on doesn't have the route to 10.233.0.1(apiserver svc ip addr in my env). It worked fine when i add a static route on this node. But the root cause i think is the ipvs module that works unproperly. Hope help others.

I have figured it out. It is because kube-proxy(ipvs) requests data from apiserver(127.0.0.1:6443)，but the node is not a master node it does not listen the 6443 http port. I fix it by replacing 127.0.0.1:6443 with 10.233.0.1(apiserver svc ip addr in my env) and rescheduling the kube-proxy pod.

pstrive on 22 Oct 2019

@pstrive I ran into a similar issue as you. I set up a Kubernetes cluster with kubespray
in an airgapped environment, and each machine did not have a default route.

I had these statically configured IP addresses:

NAME                 STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION               CONTAINER-RUNTIME
craig-airgapped1-1   Ready    master   12d   v1.15.5   10.0.0.129    <none>        CentOS Linux 7 (Core)   3.10.0-957.12.2.el7.x86_64   docker://18.9.7
craig-airgapped1-2   Ready    <none>   12d   v1.15.5   10.0.0.130    <none>        CentOS Linux 7 (Core)   3.10.0-957.12.2.el7.x86_64   docker://18.9.7
craig-airgapped1-3   Ready    <none>   12d   v1.15.5   10.0.0.131    <none>        CentOS Linux 7 (Core)   3.10.0-957.12.2.el7.x86_64   docker://18.9.7
craig-airgapped1-4   Ready    <none>   12d   v1.15.5   10.0.0.132    <none>        CentOS Linux 7 (Core)   3.10.0-957.12.2.el7.x86_64   docker://18.9.7

The CoreDNS pod was unable to access 10.233.0.1 which is the kubernetes api server in my cluster. I managed to work around this problem by logging into each of my worker nodes and doing:

ip route add default via 10.0.0.129

So at least I had a way to route 10.233.0.1 back to the kubernetes master.

Maybe it is a bug in kubespray that it didn't have a route for 10.233.01?

On 10.0.0.131, this is what my routing table looked like before my change:

10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.131 metric 100
10.233.69.0/24 via 10.0.0.129 dev tunl0 proto bird onlink
10.233.84.0/24 via 10.0.0.132 dev tunl0 proto bird onlink
10.233.85.0/24 via 10.0.0.130 dev tunl0 proto bird onlink
blackhole 10.233.90.0/24 proto bird
10.233.90.12 dev caliecb94626e50 scope link
10.233.90.15 dev cali455ddb145b1 scope link
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1

and this is what it looked like after:

default via 10.0.0.129 dev eth0
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.131 metric 100
10.233.69.0/24 via 10.0.0.129 dev tunl0 proto bird onlink
10.233.84.0/24 via 10.0.0.132 dev tunl0 proto bird onlink
10.233.85.0/24 via 10.0.0.130 dev tunl0 proto bird onlink
blackhole 10.233.90.0/24 proto bird
10.233.90.12 dev caliecb94626e50 scope link
10.233.90.15 dev cali455ddb145b1 scope link
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1