BUG REPORT
kubeadm version (use kubeadm version):
kubeadm version: &version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-24T06:51:33Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Environment:
kubectl version):Cloud provider or hardware configuration:
OpenStack
OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.1 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Kernel (e.g. uname -a):
Linux kube-apiserver-1 4.15.0-39-generic #42-Ubuntu SMP Tue Oct 23 15:48:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Others:
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
coredns-576cbf47c7-2vh8c 1/1 Running 167 13h
coredns-576cbf47c7-q88fm 1/1 Running 167 13h
kube-apiserver-kube-apiserver-1 1/1 Running 0 13h
kube-controller-manager-kube-apiserver-1 1/1 Running 2 13h
kube-flannel-ds-amd64-bmvs9 1/1 Running 0 13h
kube-proxy-dkkqs 1/1 Running 0 13h
kube-scheduler-kube-apiserver-1 1/1 Running 2 13h
$ kubectl -n kube-system logs coredns-576cbf47c7-2vh8c
E1120 16:31:29.672203 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:355: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:31:29.672382 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:348: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:31:29.673053 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:350: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:32:00.672931 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:355: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:32:00.681605 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:348: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:32:00.682868 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:350: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
Healthy coredns after installation of pod-network
Install Kubernetes using kubeadm following instructions on: https://kubernetes.io/docs/setup/independent/high-availability/
Download flannel pod network add-on from: https://raw.githubusercontent.com/coreos/flannel/bc79dd1505b0c8681ece4de4c0d86c5cd2643275/Documentation/kube-flannel.yml
Add the following environment variables to kube-flannel.yml:
...
- name: KUBERNETES_SERVICE_HOST
value: "kube-apiserver.config-service.com" #ip address of the host where kube-apiservice is running
- name: KUBERNETES_SERVICE_PORT
value: "6443"
...
Apply modified kube-flannel.yml to kubernetes: kubectl apply -f kube-flannel.yml
The external load balancer endpoint is: kube-apiserver.config-service.com
this has been configured to be a TCP pass through for port 6443 which works well for a three master nodes.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kube-apiserver-1 Ready master 13h v1.12.2
kube-apiserver-2 Ready master 2m v1.12.2
kube-apiserver-3 Ready master 14s v1.12.2
Looking at issue #1264
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 2d
Ah!
$ kubectl delete svc/kubernetes
service "kubernetes" deleted
$ kubectl get svc
No resources found.
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
coredns-576cbf47c7-2vh8c 1/1 Running 189 15h
coredns-576cbf47c7-q88fm 0/1 CrashLoopBackOff 189 15h
kube-apiserver-kube-apiserver-1 1/1 Running 0 15h
...
Arrggghhhh... such a simple solution. I bet I did not even have to use the modifications I made to the flannel template but I am not testing that today.
Can this go into the documentation for an external HA configuration?
Thank you for your support.
Can this go into the documentation for an external HA configuration? or at least as part of the "Troubleshooting kubeadm" page: https://kubernetes.io/docs/setup/independent/troubleshooting-kubeadm
Looking at issue #1264
@dannymk
hm, it's the same issue as the issue we are currently typing in.
https://github.com/kubernetes/kubeadm/issues/193#issuecomment-434660423
is this the one?
Sorry, yes, that was the one that fixed the initial problem. Celebrated too soon, I was kind of excited to see it go into service only to fail its liveness probe. I think I saw an issue on this:
$ kubectl -n kube-system describe pod/coredns-576cbf47c7-4gdnc
...
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Tue, 20 Nov 2018 14:28:58 -0500
Finished: Tue, 20 Nov 2018 14:31:17 -0500
Ready: False
Restart Count: 8
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-kb77k (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-kb77k:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-kb77k
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 23m default-scheduler Successfully assigned kube-system/coredns-576cbf47c7-4gdnc to kube-apiserver-3
Normal Pulled 19m (x3 over 23m) kubelet, kube-apiserver-3 Container image "k8s.gcr.io/coredns:1.2.2" already present on machine
Normal Created 19m (x3 over 23m) kubelet, kube-apiserver-3 Created container
Normal Killing 19m (x2 over 21m) kubelet, kube-apiserver-3 Killing container with id docker://coredns:Container failed liveness probe.. Container will be killed and recreated.
Normal Started 19m (x3 over 23m) kubelet, kube-apiserver-3 Started container
Warning Unhealthy 3m (x36 over 22m) kubelet, kube-apiserver-3 Liveness probe failed: HTTP probe failed with statuscode: 503
I also notice:
$ kubectl -n kube-system logs coredns-576cbf47c7-4gdnc
...
E1120 19:30:30.998763 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:350: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
2018/11/20 19:30:47 [INFO] SIGTERM: Shutting down servers then terminating
E1120 19:31:01.997492 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:348: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 19:31:01.998285 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:355: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 19:31:01.999754 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:350: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
Which may have to do with the svc being created in the default namespace:
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 28m
That service is recreated every time I delete it so...
Can https://github.com/apprenda/kismatic/issues/408 have something to do with this?
not sure, i asked if someone have seen something like that on slack and they said no.
we are currently busy with some 1.13 items.
You know, maybe it is a good idea to post my initialization file:
apiVersion: kubeadm.k8s.io/v1alpha3
kind: ClusterConfiguration
kubernetesVersion: stable
apiServerCertSANs:
- "kube-apiserver.config-service.com"
controlPlaneEndpoint: "kube-apiserver.config-service.com:6443"
etcd:
external:
endpoints:
- http://etcd-cluster.config-service.com:2379
networking:
podSubnet: "10.253.0.0/16"
Hmmm... I have a nagging suspicion that is I modify the kubernetes service to reflect the external load balancer instead of a clusterIP things would work.
@neolit123 What creates the kubernetes service in the default namespace?
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 28m
What creates the kubernetes service in the default namespace?
a controller manager.
It seems that although EVERYTHING seems to be working except the networking portion between pods. I suspect that the flannel networking only works because I set the env variables:
...
- name: KUBERNETES_SERVICE_HOST
value: "kube-apiserver.config-service.com" #The external load balancer
- name: KUBERNETES_SERVICE_PORT
value: "6443"
...
in the flannel configuration.
It has been suggested that I make sure IP forwarding is enabled at the KVM level on all nodes which it is:
$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 1
However I have not checked ip forwarding on the actual hosts. Could this have something to do with it?
I meet the same error!!
who can resolve it........
Hmmm... I noticed this in all the kube-proxy pods:
W1122 11:15:14.698440 1 server_others.go:295] Flag proxy-mode="" unknown, assuming iptables proxy
W1122 11:15:14.719388 1 server.go:604] Failed to retrieve node info: Unauthorized
I1122 11:15:14.719425 1 server_others.go:148] Using iptables Proxier.
W1122 11:15:14.719577 1 proxier.go:312] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
I1122 11:15:14.720425 1 server_others.go:178] Tearing down inactive rules.
I1122 11:15:15.040385 1 server.go:447] Version: v1.12.2
I1122 11:15:15.053247 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I1122 11:15:15.053647 1 conntrack.go:52] Setting nf_conntrack_max to 131072
I1122 11:15:15.053804 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I1122 11:15:15.053972 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I1122 11:15:15.054554 1 config.go:102] Starting endpoints config controller
I1122 11:15:15.054571 1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
I1122 11:15:15.054621 1 config.go:202] Starting service config controller
I1122 11:15:15.054640 1 controller_utils.go:1027] Waiting for caches to sync for service config controller
E1122 11:15:15.059301 1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Unauthorized
E1122 11:15:15.059495 1 event.go:203] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"kube-node-3.15696e04fcf91c76", GenerateName:"",
Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), Dele
tionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]stri
ng(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"", Name:"kube-node-3", UID:"kube-node-3", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"Starting", Message
:"Starting kube-proxy.", Source:v1.EventSource{Component:"kube-proxy", Host:"kube-node-3"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbef5c1d0c33b3e76, ext:420697202, loc:(*time.Location)(0x20dcf40)}
}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbef5c1d0c33b3e76, ext:420697202, loc:(*time.Location)(0x20dcf40)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*ti
me.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Unauthorized' (will not retry!)
E1122 11:15:15.059631 1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Service: Unauthorized
E1122 11:15:16.063451 1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Unauthorized
E1122 11:15:16.064363 1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Service: Unauthorized
E1122 11:15:17.067369 1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Unauthorized
E1122 11:15:17.067578 1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Service: Unauthorized
E1122 11:15:18.070488 1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Unauthorized
E1122 11:15:18.071010 1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Service: Unauthorized
E1122 11:15:19.073997 1 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Unauthorized
# Generated by iptables-save v1.6.1 on Thu Nov 22 18:13:45 2018
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [17:1020]
:POSTROUTING ACCEPT [17:1020]
:DOCKER - [0:0]
:KUBE-MARK-DROP - [0:0]
:KUBE-MARK-MASQ - [0:0]
:KUBE-POSTROUTING - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -s 10.253.0.0/16 -d 10.253.0.0/16 -j RETURN
-A POSTROUTING -s 10.253.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE
-A POSTROUTING ! -s 10.253.0.0/16 -d 10.253.0.0/24 -j RETURN
-A POSTROUTING ! -s 10.253.0.0/16 -d 10.253.0.0/16 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
COMMIT
# Completed on Thu Nov 22 18:13:45 2018
# Generated by iptables-save v1.6.1 on Thu Nov 22 18:13:45 2018
*filter
:INPUT ACCEPT [6234102:1969097221]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [7099913:2105396669]
:DOCKER - [0:0]
:DOCKER-ISOLATION-STAGE-1 - [0:0]
:DOCKER-ISOLATION-STAGE-2 - [0:0]
:DOCKER-USER - [0:0]
:KUBE-FIREWALL - [0:0]
-A INPUT -j KUBE-FIREWALL
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A FORWARD -s 10.253.0.0/16 -j ACCEPT
-A FORWARD -d 10.253.0.0/16 -j ACCEPT
-A OUTPUT -j KUBE-FIREWALL
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
COMMIT
# Completed on Thu Nov 22 18:13:45 2018
OK, for those of you that get to this point and are frustrated TRUST ME when I tell you that is is easier to destroy the etcd cluster and start from scratch than try to figure out this network issue. After three days I got tired of debugging and troubleshooting and just decided to start from scratch which I had done a few times with the masters and the nodes however I left the etcd cluster intact.
I decided to break down and destroy EVERYTHING including the etcd cluster and guess what? Now I have a fully working cluster:
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
coredns-576cbf47c7-5tlwd 1/1 Running 0 37m
coredns-576cbf47c7-vsj2z 1/1 Running 0 37m
kube-apiserver-kube-apiserver-1 1/1 Running 0 37m
kube-apiserver-kube-apiserver-2 1/1 Running 0 15m
kube-apiserver-kube-apiserver-3 1/1 Running 0 14m
kube-controller-manager-kube-apiserver-1 1/1 Running 0 37m
kube-controller-manager-kube-apiserver-2 1/1 Running 0 15m
kube-controller-manager-kube-apiserver-3 1/1 Running 0 14m
kube-flannel-ds-amd64-2dtln 1/1 Running 0 8m
kube-flannel-ds-amd64-75bgw 1/1 Running 0 10m
kube-flannel-ds-amd64-cpcjv 1/1 Running 0 35m
kube-flannel-ds-amd64-dlwww 1/1 Running 0 8m
kube-flannel-ds-amd64-dwkjb 1/1 Running 1 15m
kube-flannel-ds-amd64-msx9l 1/1 Running 0 14m
kube-flannel-ds-amd64-smhfj 1/1 Running 0 9m
kube-proxy-5rdk7 1/1 Running 0 10m
kube-proxy-8gfd7 1/1 Running 0 9m
kube-proxy-9kfxv 1/1 Running 0 37m
kube-proxy-c22dl 1/1 Running 0 8m
kube-proxy-gkvz5 1/1 Running 0 14m
kube-proxy-pxlrp 1/1 Running 0 15m
kube-proxy-vmp5h 1/1 Running 0 8m
kube-scheduler-kube-apiserver-1 1/1 Running 0 37m
kube-scheduler-kube-apiserver-2 1/1 Running 0 15m
kube-scheduler-kube-apiserver-3 1/1 Running 0 14m
It would have been a lot easier to start from scratch and avoid ALL these hours of troubleshooting than try to figure out this network problem. I just wish there was a script/method to reset the etcd DB so we did not have to rebuild from scratch. That would be an awesome tool, something like: kubeadm reset --etcd=https://etcd-cluster.control-service.com:2379
@neolit123 What do you think about an etcd reset option for an external etcd cluster?
sorry for your troubles. i think your database got in a corrupted state for some reason and keeping it around was probably not a good idea. also this is hard to debug...
there are already ways to reset etcd (but on local nodes):
https://groups.google.com/forum/#!topic/coreos-user/qcwLNqou4qQ
also sig-cluster-lifecycle (the maintainers of kubeadm)
are working of a tool called: etcdadm that will most likely have this functionality.
OK, for those of you that get to this point and are frustrated TRUST ME when I tell you that is is easier to destroy the etcd cluster and start from scratch than try to figure out this network issue. After three days I got tired of debugging and troubleshooting and just decided to start from scratch which I had done a few times with the masters and the nodes however I left the etcd cluster intact.
I decided to break down and destroy EVERYTHING including the etcd cluster and guess what? Now I have a fully working cluster:
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE coredns-576cbf47c7-5tlwd 1/1 Running 0 37m coredns-576cbf47c7-vsj2z 1/1 Running 0 37m kube-apiserver-kube-apiserver-1 1/1 Running 0 37m kube-apiserver-kube-apiserver-2 1/1 Running 0 15m kube-apiserver-kube-apiserver-3 1/1 Running 0 14m kube-controller-manager-kube-apiserver-1 1/1 Running 0 37m kube-controller-manager-kube-apiserver-2 1/1 Running 0 15m kube-controller-manager-kube-apiserver-3 1/1 Running 0 14m kube-flannel-ds-amd64-2dtln 1/1 Running 0 8m kube-flannel-ds-amd64-75bgw 1/1 Running 0 10m kube-flannel-ds-amd64-cpcjv 1/1 Running 0 35m kube-flannel-ds-amd64-dlwww 1/1 Running 0 8m kube-flannel-ds-amd64-dwkjb 1/1 Running 1 15m kube-flannel-ds-amd64-msx9l 1/1 Running 0 14m kube-flannel-ds-amd64-smhfj 1/1 Running 0 9m kube-proxy-5rdk7 1/1 Running 0 10m kube-proxy-8gfd7 1/1 Running 0 9m kube-proxy-9kfxv 1/1 Running 0 37m kube-proxy-c22dl 1/1 Running 0 8m kube-proxy-gkvz5 1/1 Running 0 14m kube-proxy-pxlrp 1/1 Running 0 15m kube-proxy-vmp5h 1/1 Running 0 8m kube-scheduler-kube-apiserver-1 1/1 Running 0 37m kube-scheduler-kube-apiserver-2 1/1 Running 0 15m kube-scheduler-kube-apiserver-3 1/1 Running 0 14mIt would have been a lot easier to start from scratch and avoid ALL these hours of troubleshooting than try to figure out this network problem. I just wish there was a script/method to reset the etcd DB so we did not have to rebuild from scratch. That would be an awesome tool, something like: kubeadm reset --etcd=https://etcd-cluster.control-service.com:2379
@neolit123 What do you think about an etcd reset option for an external etcd cluster?
haha! I know which the problem occured!
ube-system calico-etcd-h8h46 1/1 Running 1 17h
kube-system calico-kube-controllers-85cf9c8b79-q78b2 1/1 Running 2 17h
kube-system calico-node-5pvsw 2/2 Running 2 17h
kube-system calico-node-7xvn9 2/2 Running 2 17h
kube-system calico-node-85j5x 2/2 Running 3 17h
kube-system coredns-576cbf47c7-cw8lr 1/1 Running 1 17h
kube-system coredns-576cbf47c7-hvt7z 1/1 Running 1 17h
kube-system etcd-k8s-node131 1/1 Running 1 17h
kube-system kube-apiserver-k8s-node131 1/1 Running 1 17h
kube-system kube-controller-manager-k8s-node131 1/1 Running 1 17h
kube-system kube-proxy-458vk 1/1 Running 1 17h
kube-system kube-proxy-n852v 1/1 Running 1 17h
kube-system kube-proxy-p5d5g 1/1 Running 1 17h
kube-system kube-scheduler-k8s-node131 1/1 Running 1 17h
kube-system traefik-ingress-controller-fkhwk 1/1 Running 0 18m
kube-system traefik-ingress-controller-kxr6v 1/1 Running 0 18m
Now,the cluster is healthy.
reason:
deploy a cluster:
I mixed step2 and step3!
if someone have the same problem,you can try it.
/milestone clear
I did the same. But still not solved. The odds things is I have two clusters with same configuration. Same ansible-playbook with kubeadm to spin up the cluster, but one works, one doesn't
have the same problem here ... on 1.15.2
@matthewygf same. I'm running with an HA etcd setup for kubeadm. first master comes up fine but as i add masters, my node networking pods (calico-node) dont come up and the kube-proxy starts spitting out the cant list endpoints. I feel like i am missing some configuration somewhere.
@AlexMorreale we ended up bouncing the coredns pods to another node and it worked.
one other thing we changed, not sure whether it made an impact was, we found out our apiserver pod was on hostNetwork, but dnsConfig was not ClusterFirstWithHostNet.
@matthewygf thanks for pinging me back. ill try some of that stuff. i think have a lead otherwise but that is very helpful. Thanks!!!!!
I have checked my env and figure it out. It is because the node which the unavailable coredns service works on doesn't have the route to 10.233.0.1(apiserver svc ip addr in my env). It worked fine when i add a static route on this node. But the root cause i think is the ipvs module that works unproperly. Hope help others.
I have checked my env and figure it out. It is because the node which the unavailable coredns service works on doesn't have the route to 10.233.0.1(apiserver svc ip addr in my env). It worked fine when i add a static route on this node. But the root cause i think is the ipvs module that works unproperly. Hope help others.
I have figured it out. It is because kube-proxy(ipvs) requests data from apiserver(127.0.0.1:6443)锛宐ut the node is not a master node it does not listen the 6443 http port. I fix it by replacing 127.0.0.1:6443 with 10.233.0.1(apiserver svc ip addr in my env) and rescheduling the kube-proxy pod.
@pstrive I ran into a similar issue as you. I set up a Kubernetes cluster with kubespray
in an airgapped environment, and each machine did not have a default route.
I had these statically configured IP addresses:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
craig-airgapped1-1 Ready master 12d v1.15.5 10.0.0.129 <none> CentOS Linux 7 (Core) 3.10.0-957.12.2.el7.x86_64 docker://18.9.7
craig-airgapped1-2 Ready <none> 12d v1.15.5 10.0.0.130 <none> CentOS Linux 7 (Core) 3.10.0-957.12.2.el7.x86_64 docker://18.9.7
craig-airgapped1-3 Ready <none> 12d v1.15.5 10.0.0.131 <none> CentOS Linux 7 (Core) 3.10.0-957.12.2.el7.x86_64 docker://18.9.7
craig-airgapped1-4 Ready <none> 12d v1.15.5 10.0.0.132 <none> CentOS Linux 7 (Core) 3.10.0-957.12.2.el7.x86_64 docker://18.9.7
The CoreDNS pod was unable to access 10.233.0.1 which is the kubernetes api server in my cluster. I managed to work around this problem by logging into each of my worker nodes and doing:
ip route add default via 10.0.0.129
So at least I had a way to route 10.233.0.1 back to the kubernetes master.
Maybe it is a bug in kubespray that it didn't have a route for 10.233.01?
On 10.0.0.131, this is what my routing table looked like before my change:
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.131 metric 100
10.233.69.0/24 via 10.0.0.129 dev tunl0 proto bird onlink
10.233.84.0/24 via 10.0.0.132 dev tunl0 proto bird onlink
10.233.85.0/24 via 10.0.0.130 dev tunl0 proto bird onlink
blackhole 10.233.90.0/24 proto bird
10.233.90.12 dev caliecb94626e50 scope link
10.233.90.15 dev cali455ddb145b1 scope link
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
and this is what it looked like after:
default via 10.0.0.129 dev eth0
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.131 metric 100
10.233.69.0/24 via 10.0.0.129 dev tunl0 proto bird onlink
10.233.84.0/24 via 10.0.0.132 dev tunl0 proto bird onlink
10.233.85.0/24 via 10.0.0.130 dev tunl0 proto bird onlink
blackhole 10.233.90.0/24 proto bird
10.233.90.12 dev caliecb94626e50 scope link
10.233.90.15 dev cali455ddb145b1 scope link
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
Most helpful comment
haha! I know which the problem occured!
Now,the cluster is healthy.
reason:
deploy a cluster:
I mixed step2 and step3!
if someone have the same problem,you can try it.