Kubespray: Host in NotReady status

Created on 18 Sep 2020  Â·  15Comments  Â·  Source: kubernetes-sigs/kubespray

Environment:

  • Cloud provider or hardware configuration:
    OpenStack VMs
  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    Ubuntu 16.04
  • Version of Ansible (ansible --version):
    2.7.12
  • Version of Python (python --version):
    2.7.12

Kubespray version (commit) (git rev-parse --short HEAD):
67167bd

Network plugin used:
calico

Error situation

# kubectl get nodes
NAME                      STATUS     ROLES    AGE   VERSION
testcpu0-1111972-iaas     Ready      <none>   29d   v1.15.11
testcpu1-1111975-iaas     Ready      <none>   29d   v1.15.11
testgpu0-1111945-iaas     Ready      <none>   29d   v1.15.11
testgpu1-1112050-iaas     NotReady   <none>   29d   v1.15.11
testmaster-1111978-iaas   Ready      master   29d   v1.15.11
~/pai# kubectl get pods -n kube-system -o wide | grep nginx
nginx-proxy-testcpu0-1111972-iaas                 1/1     Running   1          27d   192.168.211.37   testcpu0-1111972-iaas     <none>           <none>
nginx-proxy-testcpu1-1111975-iaas                 1/1     Running   11         27d   192.168.211.36   testcpu1-1111975-iaas     <none>           <none>
nginx-proxy-testgpu0-1111945-iaas                 1/1     Running   1          27d   192.168.211.31   testgpu0-1111945-iaas     <none>           <none>
nginx-proxy-testgpu1-1112050-iaas                 0/1     Running   274        18h   192.168.211.44   testgpu1-1112050-iaas     <none>           <none>



md5-24389ad10f767599921a9773aa9a4f2b



ubuntu@testgpu1-1112050-iaas:~$ sudo docker ps -a | grep nginx
d6b67ea98a36        53f3fd8007f7                               "nginx -g 'daemon of…"   4 minutes ago       Exited (0) 4 minutes ago                       k8s_nginx-proxy_nginx-proxy-testgpu1-1112050-iaas_kube-system_9bff03b93ed8b5b1440d9d6baef588ca_924
6341bab4af45        gcr.io/google_containers/pause-amd64:3.1   "/pause"                 46 hours ago        Up 46 hours                                    k8s_POD_nginx-proxy-testgpu1-1112050-iaas_kube-system_9bff03b93ed8b5b1440d9d6baef588ca_0
6c84d5ea5fe0        42f78ad0d76a                               "/usr/local/openrest…"   4 weeks ago         Up 4 weeks                                     k8s_log-manager-nginx_log-manager-ds-bvjbw_default_3e1b393d-b4f6-49dc-95d7-22ed4aa96f5b_1
ubuntu@testgpu1-1112050-iaas:~$ sudo docker logs d6b67ea98a36
2020/09/18 10:09:52 [notice] 1#1: using the "epoll" event method
2020/09/18 10:09:52 [notice] 1#1: nginx/1.15.12
2020/09/18 10:09:52 [notice] 1#1: built by gcc 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
2020/09/18 10:09:52 [notice] 1#1: OS: Linux 4.4.0-186-generic
2020/09/18 10:09:52 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2020/09/18 10:09:52 [notice] 1#1: start worker processes
2020/09/18 10:09:52 [notice] 1#1: start worker process 6
2020/09/18 10:09:52 [notice] 1#1: start worker process 7
2020/09/18 10:10:21 [notice] 1#1: signal 15 (SIGTERM) received, exiting
2020/09/18 10:10:21 [notice] 7#7: exiting
2020/09/18 10:10:21 [notice] 6#6: exiting
2020/09/18 10:10:21 [notice] 7#7: exit
2020/09/18 10:10:21 [notice] 6#6: exit
2020/09/18 10:10:21 [notice] 1#1: signal 17 (SIGCHLD) received from 6
2020/09/18 10:10:21 [notice] 1#1: worker process 6 exited with code 0
2020/09/18 10:10:21 [notice] 1#1: signal 29 (SIGIO) received
2020/09/18 10:10:21 [notice] 1#1: signal 17 (SIGCHLD) received from 7
2020/09/18 10:10:21 [notice] 1#1: worker process 7 exited with code 0
2020/09/18 10:10:21 [notice] 1#1: exit
kinbug

Most helpful comment

@EppO
Sorry to give incorrect information.

Pod A is a customized service use 8080 as container port and 8081 as host port.
I checked the nginx-proxy setting in kubespray and the loadbalancer_apiserver_healthcheck_port is set to 8081

Therefore, I modified Pod A spec to let the host port not equal to 8081 and let the Pod A run on k8sworker again. Current the worker is still in Ready status and I will keep to monitor it for a while.

Thanks @EppO and @floryut for your kind support.

All 15 comments

@JosephKang Could you please check kubelet logs on the node ?

@floryut ,

The log is posted below. How to check why the 6443 connection failure?

journalctl -u kubelet

Sep 18 18:39:24 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:24.515257   24732 kubelet_node_status.go:375] Unable to update node status: update node status exceeds retry count
Sep 18 18:39:24 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:24.708809   24732 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://localhost:6443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:24 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:24.761994   24732 controller.go:125] failed to ensure node lease exists, will retry in 7s, error: Get https://localhost:6443/apis/coordination.k8s.io/v1beta1/namespaces/kube-node-lease/leases/testgpu1-1112050-iaas?timeout=10s: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:24 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:24.908862   24732 reflector.go:125] object-"kube-system"/"default-token-7ctgg": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Ddefault-token-7ctgg&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:25 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:25.108679   24732 reflector.go:125] object-"default"/"gpu-configuration": Failed to list *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/default/configmaps?fieldSelector=metadata.name%3Dgpu-configuration&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:25 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:25.308817   24732 reflector.go:125] object-"kube-system"/"kube-proxy-token-jlpk5": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Dkube-proxy-token-jlpk5&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:25 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:25.508342   24732 reflector.go:125] object-"aifs"/"image-pull-secret": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/aifs/secrets?fieldSelector=metadata.name%3Dimage-pull-secret&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:25 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:25.708773   24732 reflector.go:125] object-"kube-system"/"host-devices": Failed to list *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dhost-devices&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:25 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:25.908656   24732 reflector.go:125] object-"default"/"pai-secret": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/default/secrets?fieldSelector=metadata.name%3Dpai-secret&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:26 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:26.108436   24732 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.CSIDriver: Get https://localhost:6443/apis/storage.k8s.io/v1beta1/csidrivers?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:26 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:26.308980   24732 reflector.go:125] object-"aifs"/"default-token-6n2xs": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/aifs/secrets?fieldSelector=metadata.name%3Ddefault-token-6n2xs&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:26 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:26.508754   24732 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:454: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3Dtestgpu1-1112050-iaas&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:26 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:26.708734   24732 reflector.go:125] object-"kube-system"/"kube-proxy": Failed to list *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dkube-proxy&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:26 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:26.908774   24732 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dtestgpu1-1112050-iaas&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:27 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:27.108426   24732 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:445: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:27 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:27.308447   24732 reflector.go:125] object-"kube-system"/"calico-node-token-spv66": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Dcalico-node-token-spv66&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:27 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:27.508659   24732 reflector.go:125] object-"default"/"default-token-zq9z7": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/default/secrets?fieldSelector=metadata.name%3Ddefault-token-zq9z7&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:27 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:27.708533   24732 reflector.go:125] object-"kube-system"/"calico-config": Failed to list *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dcalico-config&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:27 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:27.908603   24732 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: Get https://localhost:6443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:28 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:28.108210   24732 reflector.go:125] object-"kube-system"/"default-token-7ctgg": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Ddefault-token-7ctgg&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:28 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:28.308708   24732 reflector.go:125] object-"default"/"gpu-configuration": Failed to list *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/default/configmaps?fieldSelector=metadata.name%3Dgpu-configuration&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:28 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:28.508735   24732 reflector.go:125] object-"kube-system"/"kube-proxy-token-jlpk5": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Dkube-proxy-token-jlpk5&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:28 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:28.708803   24732 reflector.go:125] object-"aifs"/"image-pull-secret": Failed to list *v1.Secret: Get https://localhost:6443/api/v1/namespaces/aifs/secrets?fieldSelector=metadata.name%3Dimage-pull-secret&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
Sep 18 18:39:28 testgpu1-1112050-iaas kubelet[24732]: E0918 18:39:28.908850   24732 reflector.go:125] object-"kube-system"/"host-devices": Failed to list *v1.ConfigMap: Get https://localhost:6443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dhost-devices&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused

Ok this means that your api is not up, can you check apiserver container on this node ?

The kube api server is actually running on master, testmaster-1111978-iaas, instead of running on the NotReady worker, testgpu1-1112050-iaas

root@testdevbox-1111981-iaas:~# kubectl get pods --all-namespaces -o wide | grep apiserver
kube-system   kube-apiserver-testmaster-1111978-iaas              1/1     Running   0          31d     192.168.211.26   testmaster-1111978-iaas   <none>           <none>

The kubelet log on the NotReady worker shown that it try to connnect the local api-server. May I know how to check the connection setting on the node? (Maybe the path location inside the container testgpu1-1112050-iaas ?)

ubuntu@testgpu1-1112050-iaas:~$ netstat | grep 6443
tcp        0      0 localhost:6443          localhost:36182         ESTABLISHED
tcp        0      0 localhost:36100         localhost:6443          TIME_WAIT
tcp        0      0 testgpu1-1112050-:59578 testmaster-1111978:6443 ESTABLISHED
tcp        0      0 testgpu1-1112050-:59582 testmaster-1111978:6443 ESTABLISHED
tcp        0      0 localhost:36190         localhost:6443          TIME_WAIT
tcp        0      0 localhost:36186         localhost:6443          ESTABLISHED
tcp        0      0 localhost:6443          localhost:36186         ESTABLISHED
tcp        0      0 localhost:36182         localhost:6443          ESTABLISHED

Yes but you should have an apiserver on the node notReady, you need to check the logs of the container directly on the node (docker ps -a | grep apiserver or ctr/crictl depending on your runtime).
The kubelet is not ready because it indeed try to reach the local apiserver container, this container should be in error or have erroneous logs

Oh wait, I misunderstood your problem.. the notReady node is not a master node ?
Can you check the kubeconfig file on the node ? (especially the server part)

You're right, it's a connectivity issue with the control plane node because the nginx used for the "localhost loadbalancing" feature is crashing on that node, hence no connectivity with the API server.
Can you check the nginx logs on that node? maybe it'll give in a hint why it crashes

@floryut

Yes, the NotReady node is one of the worker among 4

ubuntu@testgpu1-1112050-iaas:~$ ls -al /etc/kubernetes/
total 48
drwxr-xr-x   4 kube root  4096 Aug 20 17:51 .
drwxr-xr-x 126 root root 12288 Aug 20 17:51 ..
-rw-------   1 root root  1791 Aug 20 17:51 bootstrap-kubelet.conf
-rw-r--r--   1 root root   451 Aug 20 17:51 kubeadm-client.conf
-rw-------   1 root root  1850 Aug 20 17:51 kubelet.conf
-rw-------   1 root root  1855 Aug 20 17:51 kubelet.conf.9875.2020-08-20@17:51:24~
-rw-r--r--   1 root root   561 Aug 20 17:51 kubelet-config.yaml
-rw-r--r--   1 root root   717 Aug 20 17:51 kubelet.env
drwxr-xr-x   2 kube root  4096 Aug 20 17:51 manifests
lrwxrwxrwx   1 root root    19 Aug 20 17:50 pki -> /etc/kubernetes/ssl
drwxr-xr-x   2 kube root  4096 Aug 20 17:51 ssl
ubuntu@testgpu1-1112050-iaas:~$ sudo cat /etc/kubernetes/kubelet.conf
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ...
    server: https://localhost:6443
  name: default-cluster
contexts:
- context:
    cluster: default-cluster
    namespace: default
    user: default-auth
  name: default-context
current-context: default-context
kind: Config
preferences: {}
users:
- name: default-auth
  user:
    client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
    client-key: /var/lib/kubelet/pki/kubelet-client-current.pem

I changed the server setting from localhost to master ip and then restart kubelet. However, the NotReady situation is the same. Moreover, the other Ready node has the same localohst server setting so I am a little confusion about the setting... May I know more suggestion?

@EppO ,

Please refer to the following k8s_nginx-proxy log and let me know your comment.

ubuntu@testgpu1-1112050-iaas:~$ sudo docker ps -a | grep nginx
d6b67ea98a36        53f3fd8007f7                               "nginx -g 'daemon of…"   4 minutes ago       Exited (0) 4 minutes ago                       k8s_nginx-proxy_nginx-proxy-testgpu1-1112050-iaas_kube-system_9bff03b93ed8b5b1440d9d6baef588ca_924
6341bab4af45        gcr.io/google_containers/pause-amd64:3.1   "/pause"                 46 hours ago        Up 46 hours                                    k8s_POD_nginx-proxy-testgpu1-1112050-iaas_kube-system_9bff03b93ed8b5b1440d9d6baef588ca_0
6c84d5ea5fe0        42f78ad0d76a                               "/usr/local/openrest…"   4 weeks ago         Up 4 weeks                                     k8s_log-manager-nginx_log-manager-ds-bvjbw_default_3e1b393d-b4f6-49dc-95d7-22ed4aa96f5b_1
ubuntu@testgpu1-1112050-iaas:~$ sudo docker logs d6b67ea98a36
2020/09/18 10:09:52 [notice] 1#1: using the "epoll" event method
2020/09/18 10:09:52 [notice] 1#1: nginx/1.15.12
2020/09/18 10:09:52 [notice] 1#1: built by gcc 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
2020/09/18 10:09:52 [notice] 1#1: OS: Linux 4.4.0-186-generic
2020/09/18 10:09:52 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2020/09/18 10:09:52 [notice] 1#1: start worker processes
2020/09/18 10:09:52 [notice] 1#1: start worker process 6
2020/09/18 10:09:52 [notice] 1#1: start worker process 7
2020/09/18 10:10:21 [notice] 1#1: signal 15 (SIGTERM) received, exiting
2020/09/18 10:10:21 [notice] 7#7: exiting
2020/09/18 10:10:21 [notice] 6#6: exiting
2020/09/18 10:10:21 [notice] 7#7: exit
2020/09/18 10:10:21 [notice] 6#6: exit
2020/09/18 10:10:21 [notice] 1#1: signal 17 (SIGCHLD) received from 6
2020/09/18 10:10:21 [notice] 1#1: worker process 6 exited with code 0
2020/09/18 10:10:21 [notice] 1#1: signal 29 (SIGIO) received
2020/09/18 10:10:21 [notice] 1#1: signal 17 (SIGCHLD) received from 7
2020/09/18 10:10:21 [notice] 1#1: worker process 7 exited with code 0
2020/09/18 10:10:21 [notice] 1#1: exit

Check nginx-proxy log on both NotReady node and Ready Node and it seems some Authorization error.

May I know how to Re-Auth the node?

* nginx-proxy-testgpu1-1112050-iaas *

~# kubectl logs nginx-proxy-testgpu1-1112050-iaas -n kube-system
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)

* nginx-proxy-testgpu0-1111945-iaas *

~# kubectl logs nginx-proxy-testgpu0-1111945-iaas -n kube-system
2020/08/21 06:45:23 [notice] 1#1: using the "epoll" event method
2020/08/21 06:45:23 [notice] 1#1: nginx/1.15.12
2020/08/21 06:45:23 [notice] 1#1: built by gcc 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
2020/08/21 06:45:23 [notice] 1#1: OS: Linux 4.4.0-186-generic
2020/08/21 06:45:23 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2020/08/21 06:45:23 [notice] 1#1: start worker processes
2020/08/21 06:45:23 [notice] 1#1: start worker process 7
2020/08/21 06:45:23 [notice] 1#1: start worker process 8

It seems some inter Pod interference.

I did some test

  1. Found one pod A listening on port 8080.
  2. Move pod A to another worker.
  3. The new worker enters NotReady STATUS
  4. The old worker enters Ready STATUS

I still don't know the exact root cause for the issue and I just move the pod A to the master to workaround the strange situation.

what's the pod A? you shouldn't have a pod listening to the host port 8080, as it's used by nginx for the localhost load balancing feature. Modify pod A specs to use any other host port and it should work

@EppO
Sorry to give incorrect information.

Pod A is a customized service use 8080 as container port and 8081 as host port.
I checked the nginx-proxy setting in kubespray and the loadbalancer_apiserver_healthcheck_port is set to 8081

Therefore, I modified Pod A spec to let the host port not equal to 8081 and let the Pod A run on k8sworker again. Current the worker is still in Ready status and I will keep to monitor it for a while.

Thanks @EppO and @floryut for your kind support.

glad you figured it out, I guess we can close the issue.
/close

@EppO: Closing this issue.

In response to this:

glad you figured it out, I guess we can close the issue.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings