After deploying a new cluster we have experienced a strange problem — kube-router pod on some node stucks in CrashLoopBackOff.
The log file of kube-router says timeout connecting to API server:
E0521 09:25:04.217633 1733 reflector.go:205] github.com/cloudnativelabs/kube-router/vendor/k8s.io/client-go/informers/factory.go:73: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?resourceVersion=0: dial tcp: i/o timeout
Checking strace of the kube-router reveals that it tries to resolve localhost by querying nodelocaldns on it's IP address and gets a timeout.
In logs of nodelocaldns it tries to access DNS service IP's provided by kube-router.
The problem solves if 127.0.0.1 is specified instead of localhost in kube-router kubeconfig, in inventory like this:
kube_apiserver_endpoint: https://127.0.0.1:6443
Environment:
printf "$(uname -srm)\n$(cat /etc/os-release)\n"):Linux 4.18.0-147.8.1.el8_1.x86_64 x86_64
NAME="CentOS Linux"
VERSION="8 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="8"
Version of Ansible (ansible --version):
Version of Python (python --version):
Kubespray version (commit) (git rev-parse --short HEAD):
2.13.0
01dbc909be34c9c8b34cb9d5e88a4f0e74affcbc
Network plugin used:
kube-router
Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
Command used to invoke ansible:
Output of ansible run:
Anything else do we need to know:
I ran into this issue as well with version 2.13.0 with Ubuntu 18.04 nodes
I got the same issue as well
This is because kube-router uses alpine as base image where /etc/nsswitch.conf is not included, as a result, the localhost cannot be resolved from /etc/hosts. I submitted https://github.com/cloudnativelabs/kube-router/pull/957 to work around this issue.
This is because kube-router uses alpine as base image where /etc/nsswitch.conf is not included, as a result, the localhost cannot be resolved from /etc/hosts. I submitted cloudnativelabs/kube-router#957 to work around this issue.
My kube-router PR has been merged and released in v1.0.1, and https://github.com/kubernetes-sigs/kubespray/pull/6479 has updated kube-router to v1.0.1 in Kubespray. So this issue should have been fixed.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Most helpful comment
This is because kube-router uses alpine as base image where /etc/nsswitch.conf is not included, as a result, the localhost cannot be resolved from /etc/hosts. I submitted https://github.com/cloudnativelabs/kube-router/pull/957 to work around this issue.