Environment:
printf "$(uname -srm)\n$(cat /etc/os-release)\n"):ansible --version):python --version):Kubespray version (commit) (git rev-parse --short HEAD):
67167bd
Network plugin used:
calico
Error situation
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
testcpu0-1111972-iaas Ready <none> 29d v1.15.11
testcpu1-1111975-iaas Ready <none> 29d v1.15.11
testgpu0-1111945-iaas Ready <none> 29d v1.15.11
testgpu1-1112050-iaas NotReady <none> 29d v1.15.11
testmaster-1111978-iaas Ready master 29d v1.15.11
~/pai# kubectl get pods -n kube-system -o wide | grep nginx
nginx-proxy-testcpu0-1111972-iaas 1/1 Running 1 27d 192.168.211.37 testcpu0-1111972-iaas <none> <none>
nginx-proxy-testcpu1-1111975-iaas 1/1 Running 11 27d 192.168.211.36 testcpu1-1111975-iaas <none> <none>
nginx-proxy-testgpu0-1111945-iaas 1/1 Running 1 27d 192.168.211.31 testgpu0-1111945-iaas <none> <none>
nginx-proxy-testgpu1-1112050-iaas 0/1 Running 274 18h 192.168.211.44 testgpu1-1112050-iaas <none> <none>
md5-24389ad10f767599921a9773aa9a4f2b
ubuntu@testgpu1-1112050-iaas:~$ sudo docker ps -a | grep nginx
d6b67ea98a36 53f3fd8007f7 "nginx -g 'daemon of…" 4 minutes ago Exited (0) 4 minutes ago k8s_nginx-proxy_nginx-proxy-testgpu1-1112050-iaas_kube-system_9bff03b93ed8b5b1440d9d6baef588ca_924
6341bab4af45 gcr.io/google_containers/pause-amd64:3.1 "/pause" 46 hours ago Up 46 hours k8s_POD_nginx-proxy-testgpu1-1112050-iaas_kube-system_9bff03b93ed8b5b1440d9d6baef588ca_0
6c84d5ea5fe0 42f78ad0d76a "/usr/local/openrest…" 4 weeks ago Up 4 weeks k8s_log-manager-nginx_log-manager-ds-bvjbw_default_3e1b393d-b4f6-49dc-95d7-22ed4aa96f5b_1
ubuntu@testgpu1-1112050-iaas:~$ sudo docker logs d6b67ea98a36
2020/09/18 10:09:52 [notice] 1#1: using the "epoll" event method
2020/09/18 10:09:52 [notice] 1#1: nginx/1.15.12
2020/09/18 10:09:52 [notice] 1#1: built by gcc 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
2020/09/18 10:09:52 [notice] 1#1: OS: Linux 4.4.0-186-generic
2020/09/18 10:09:52 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2020/09/18 10:09:52 [notice] 1#1: start worker processes
2020/09/18 10:09:52 [notice] 1#1: start worker process 6
2020/09/18 10:09:52 [notice] 1#1: start worker process 7
2020/09/18 10:10:21 [notice] 1#1: signal 15 (SIGTERM) received, exiting
2020/09/18 10:10:21 [notice] 7#7: exiting
2020/09/18 10:10:21 [notice] 6#6: exiting
2020/09/18 10:10:21 [notice] 7#7: exit
2020/09/18 10:10:21 [notice] 6#6: exit
2020/09/18 10:10:21 [notice] 1#1: signal 17 (SIGCHLD) received from 6
2020/09/18 10:10:21 [notice] 1#1: worker process 6 exited with code 0
2020/09/18 10:10:21 [notice] 1#1: signal 29 (SIGIO) received
2020/09/18 10:10:21 [notice] 1#1: signal 17 (SIGCHLD) received from 7
2020/09/18 10:10:21 [notice] 1#1: worker process 7 exited with code 0
2020/09/18 10:10:21 [notice] 1#1: exit
@JosephKang Could you please check kubelet logs on the node ?
@floryut ,
The log is posted below. How to check why the 6443 connection failure?
journalctl -u kubelet
Sep 18 18:39:24 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:24.515257 24732 kubelet_node_status.go:375] Unable to update node status: update node status exceeds retry count
Sep 18 18:39:24 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:24.708809 24732 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://localhost:6443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:24 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:24.761994 24732 controller.go:125] failed to ensure node lease exists, will retry in 7s, error: Get https://localhost:6443/apis/coordination.k8s.io/v1beta1/namespaces/kube-node-lease/leases/testgpu1-1112050-iaas?timeout=10s: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:24 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:24.908862 24732 reflector.go:125] object-"kube-system"/"default-token-7ctgg": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Ddefault-token-7ctgg&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:25 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:25.108679 24732 reflector.go:125] object-"default"/"gpu-configuration": Failed to list *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/default/configmaps?fieldSelector=metadata.name%3Dgpu-configuration&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:25 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:25.308817 24732 reflector.go:125] object-"kube-system"/"kube-proxy-token-jlpk5": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Dkube-proxy-token-jlpk5&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:25 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:25.508342 24732 reflector.go:125] object-"aifs"/"image-pull-secret": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/aifs/secrets?fieldSelector=metadata.name%3Dimage-pull-secret&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:25 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:25.708773 24732 reflector.go:125] object-"kube-system"/"host-devices": Failed to list *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dhost-devices&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:25 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:25.908656 24732 reflector.go:125] object-"default"/"pai-secret": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/default/secrets?fieldSelector=metadata.name%3Dpai-secret&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:26 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:26.108436 24732 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Get https://localhost:6443/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:26 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:26.308980 24732 reflector.go:125] object-"aifs"/"default-token-6n2xs": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/aifs/secrets?fieldSelector=metadata.name%3Ddefault-token-6n2xs&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:26 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:26.508754 24732 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:454: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3Dtestgpu1-1112050-iaas&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:26 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:26.708734 24732 reflector.go:125] object-"kube-system"/"kube-proxy": Failed to list *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dkube-proxy&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:26 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:26.908774 24732 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dtestgpu1-1112050-iaas&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:27 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:27.108426 24732 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:445: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:27 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:27.308447 24732 reflector.go:125] object-"kube-system"/"calico-node-token-spv66": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Dcalico-node-token-spv66&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:27 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:27.508659 24732 reflector.go:125] object-"default"/"default-token-zq9z7": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/default/secrets?fieldSelector=metadata.name%3Ddefault-token-zq9z7&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:27 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:27.708533 24732 reflector.go:125] object-"kube-system"/"calico-config": Failed to list *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dcalico-config&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:27 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:27.908603 24732 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://localhost:6443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:28 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:28.108210 24732 reflector.go:125] object-"kube-system"/"default-token-7ctgg": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Ddefault-token-7ctgg&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:28 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:28.308708 24732 reflector.go:125] object-"default"/"gpu-configuration": Failed to list *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/default/configmaps?fieldSelector=metadata.name%3Dgpu-configuration&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:28 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:28.508735 24732 reflector.go:125] object-"kube-system"/"kube-proxy-token-jlpk5": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Dkube-proxy-token-jlpk5&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:28 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:28.708803 24732 reflector.go:125] object-"aifs"/"image-pull-secret": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/aifs/secrets?fieldSelector=metadata.name%3Dimage-pull-secret&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:28 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:28.908850 24732 reflector.go:125] object-"kube-system"/"host-devices": Failed to list *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dhost-devices&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Ok this means that your api is not up, can you check apiserver container on this node ?
The kube api server is actually running on master, testmaster-1111978-iaas, instead of running on the NotReady worker, testgpu1-1112050-iaas
root@testdevbox-1111981-iaas:~# kubectl get pods --all-namespaces -o wide | grep apiserver
kube-system kube-apiserver-testmaster-1111978-iaas 1/1 Running 0 31d 192.168.211.26 testmaster-1111978-iaas <none> <none>
The kubelet log on the NotReady worker shown that it try to connnect the local api-server. May I know how to check the connection setting on the node? (Maybe the path location inside the container testgpu1-1112050-iaas ?)
ubuntu@testgpu1-1112050-iaas:~$ netstat | grep 6443
tcp 0 0 localhost:6443 localhost:36182 ESTABLISHED
tcp 0 0 localhost:36100 localhost:6443 TIME_WAIT
tcp 0 0 testgpu1-1112050-:59578 testmaster-1111978:6443 ESTABLISHED
tcp 0 0 testgpu1-1112050-:59582 testmaster-1111978:6443 ESTABLISHED
tcp 0 0 localhost:36190 localhost:6443 TIME_WAIT
tcp 0 0 localhost:36186 localhost:6443 ESTABLISHED
tcp 0 0 localhost:6443 localhost:36186 ESTABLISHED
tcp 0 0 localhost:36182 localhost:6443 ESTABLISHED
Yes but you should have an apiserver on the node notReady, you need to check the logs of the container directly on the node (docker ps -a | grep apiserver or ctr/crictl depending on your runtime).
The kubelet is not ready because it indeed try to reach the local apiserver container, this container should be in error or have erroneous logs
Oh wait, I misunderstood your problem.. the notReady node is not a master node ?
Can you check the kubeconfig file on the node ? (especially the server part)
You're right, it's a connectivity issue with the control plane node because the nginx used for the "localhost loadbalancing" feature is crashing on that node, hence no connectivity with the API server.
Can you check the nginx logs on that node? maybe it'll give in a hint why it crashes
@floryut
Yes, the NotReady node is one of the worker among 4
ubuntu@testgpu1-1112050-iaas:~$ ls -al /etc/kubernetes/
total 48
drwxr-xr-x 4 kube root 4096 Aug 20 17:51 .
drwxr-xr-x 126 root root 12288 Aug 20 17:51 ..
-rw------- 1 root root 1791 Aug 20 17:51 bootstrap-kubelet.conf
-rw-r--r-- 1 root root 451 Aug 20 17:51 kubeadm-client.conf
-rw------- 1 root root 1850 Aug 20 17:51 kubelet.conf
-rw------- 1 root root 1855 Aug 20 17:51 kubelet.conf.9875.2020-08-20@17:51:24~
-rw-r--r-- 1 root root 561 Aug 20 17:51 kubelet-config.yaml
-rw-r--r-- 1 root root 717 Aug 20 17:51 kubelet.env
drwxr-xr-x 2 kube root 4096 Aug 20 17:51 manifests
lrwxrwxrwx 1 root root 19 Aug 20 17:50 pki -> /etc/kubernetes/ssl
drwxr-xr-x 2 kube root 4096 Aug 20 17:51 ssl
ubuntu@testgpu1-1112050-iaas:~$ sudo cat /etc/kubernetes/kubelet.conf
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: ...
server: https://localhost:6443
name: default-cluster
contexts:
- context:
cluster: default-cluster
namespace: default
user: default-auth
name: default-context
current-context: default-context
kind: Config
preferences: {}
users:
- name: default-auth
user:
client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
I changed the server setting from localhost to master ip and then restart kubelet. However, the NotReady situation is the same. Moreover, the other Ready node has the same localohst server setting so I am a little confusion about the setting... May I know more suggestion?
@EppO ,
Please refer to the following k8s_nginx-proxy log and let me know your comment.
ubuntu@testgpu1-1112050-iaas:~$ sudo docker ps -a | grep nginx
d6b67ea98a36 53f3fd8007f7 "nginx -g 'daemon of…" 4 minutes ago Exited (0) 4 minutes ago k8s_nginx-proxy_nginx-proxy-testgpu1-1112050-iaas_kube-system_9bff03b93ed8b5b1440d9d6baef588ca_924
6341bab4af45 gcr.io/google_containers/pause-amd64:3.1 "/pause" 46 hours ago Up 46 hours k8s_POD_nginx-proxy-testgpu1-1112050-iaas_kube-system_9bff03b93ed8b5b1440d9d6baef588ca_0
6c84d5ea5fe0 42f78ad0d76a "/usr/local/openrest…" 4 weeks ago Up 4 weeks k8s_log-manager-nginx_log-manager-ds-bvjbw_default_3e1b393d-b4f6-49dc-95d7-22ed4aa96f5b_1
ubuntu@testgpu1-1112050-iaas:~$ sudo docker logs d6b67ea98a36
2020/09/18 10:09:52 [notice] 1#1: using the "epoll" event method
2020/09/18 10:09:52 [notice] 1#1: nginx/1.15.12
2020/09/18 10:09:52 [notice] 1#1: built by gcc 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
2020/09/18 10:09:52 [notice] 1#1: OS: Linux 4.4.0-186-generic
2020/09/18 10:09:52 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2020/09/18 10:09:52 [notice] 1#1: start worker processes
2020/09/18 10:09:52 [notice] 1#1: start worker process 6
2020/09/18 10:09:52 [notice] 1#1: start worker process 7
2020/09/18 10:10:21 [notice] 1#1: signal 15 (SIGTERM) received, exiting
2020/09/18 10:10:21 [notice] 7#7: exiting
2020/09/18 10:10:21 [notice] 6#6: exiting
2020/09/18 10:10:21 [notice] 7#7: exit
2020/09/18 10:10:21 [notice] 6#6: exit
2020/09/18 10:10:21 [notice] 1#1: signal 17 (SIGCHLD) received from 6
2020/09/18 10:10:21 [notice] 1#1: worker process 6 exited with code 0
2020/09/18 10:10:21 [notice] 1#1: signal 29 (SIGIO) received
2020/09/18 10:10:21 [notice] 1#1: signal 17 (SIGCHLD) received from 7
2020/09/18 10:10:21 [notice] 1#1: worker process 7 exited with code 0
2020/09/18 10:10:21 [notice] 1#1: exit
Check nginx-proxy log on both NotReady node and Ready Node and it seems some Authorization error.
May I know how to Re-Auth the node?
* nginx-proxy-testgpu1-1112050-iaas *
~# kubectl logs nginx-proxy-testgpu1-1112050-iaas -n kube-system
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)
* nginx-proxy-testgpu0-1111945-iaas *
~# kubectl logs nginx-proxy-testgpu0-1111945-iaas -n kube-system
2020/08/21 06:45:23 [notice] 1#1: using the "epoll" event method
2020/08/21 06:45:23 [notice] 1#1: nginx/1.15.12
2020/08/21 06:45:23 [notice] 1#1: built by gcc 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
2020/08/21 06:45:23 [notice] 1#1: OS: Linux 4.4.0-186-generic
2020/08/21 06:45:23 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2020/08/21 06:45:23 [notice] 1#1: start worker processes
2020/08/21 06:45:23 [notice] 1#1: start worker process 7
2020/08/21 06:45:23 [notice] 1#1: start worker process 8
It seems some inter Pod interference.
I did some test
I still don't know the exact root cause for the issue and I just move the pod A to the master to workaround the strange situation.
what's the pod A? you shouldn't have a pod listening to the host port 8080, as it's used by nginx for the localhost load balancing feature. Modify pod A specs to use any other host port and it should work
@EppO
Sorry to give incorrect information.
Pod A is a customized service use 8080 as container port and 8081 as host port.
I checked the nginx-proxy setting in kubespray and the loadbalancer_apiserver_healthcheck_port is set to 8081
Therefore, I modified Pod A spec to let the host port not equal to 8081 and let the Pod A run on k8sworker again. Current the worker is still in Ready status and I will keep to monitor it for a while.
Thanks @EppO and @floryut for your kind support.
glad you figured it out, I guess we can close the issue.
/close
@EppO: Closing this issue.
In response to this:
glad you figured it out, I guess we can close the issue.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Most helpful comment
@EppO
Sorry to give incorrect information.
Pod A is a customized service use 8080 as container port and 8081 as host port.
I checked the nginx-proxy setting in kubespray and the loadbalancer_apiserver_healthcheck_port is set to 8081
Therefore, I modified Pod A spec to let the host port not equal to 8081 and let the Pod A run on k8sworker again. Current the worker is still in Ready status and I will keep to monitor it for a while.
Thanks @EppO and @floryut for your kind support.