calico node try to bind a non-exists ip address
my host ip is 10.12.100.23, the livenessProbe should bind 10.12.100.23:9099 or 127.0.0.1:9099 , but calico-node -felix try to bind a non-exists ip address.
this is the log
2018-08-26 15:06:02.444 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address
2018-08-26 15:06:02.504 [INFO][68] int_dataplane.go 734: Applying dataplane updates
2018-08-26 15:06:02.504 [INFO][68] table.go 717: Invalidating dataplane cache ipVersion=0x4 reason="post update" table="raw"
2018-08-26 15:06:02.504 [INFO][68] table.go 438: Loading current iptables state and checking it is correct. ipVersion=0x4 table="raw"
2018-08-26 15:06:02.504 [INFO][68] table.go 717: Invalidating dataplane cache ipVersion=0x4 reason="post update" table="nat"
2018-08-26 15:06:02.504 [INFO][68] table.go 717: Invalidating dataplane cache ipVersion=0x4 reason="post update" table="mangle"
2018-08-26 15:06:02.504 [INFO][68] table.go 438: Loading current iptables state and checking it is correct. ipVersion=0x4 table="nat"
2018-08-26 15:06:02.504 [INFO][68] table.go 438: Loading current iptables state and checking it is correct. ipVersion=0x4 table="mangle"
2018-08-26 15:06:02.504 [INFO][68] table.go 717: Invalidating dataplane cache ipVersion=0x4 reason="post update" table="filter"
2018-08-26 15:06:02.505 [INFO][68] table.go 438: Loading current iptables state and checking it is correct. ipVersion=0x4 table="filter"
2018-08-26 15:06:02.508 [INFO][68] int_dataplane.go 748: Finished applying updates to dataplane. msecToApply=3.859451
2018-08-26 15:06:03.445 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address
2018-08-26 15:06:04.447 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address
bird: Mesh_10_112_35_117: State changed to start
2018-08-26 15:06:05.448 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address
the ip 115.18.44.218 is not my host ip, it is the default dns record of a non-exists domain in my company.
in kubernetes dashboard , the error message is
Readiness probe failed: calico/node is not ready: felix is not ready: Get http://localhost:9099/readiness: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Back-off restarting failed container
1.
2.
3.
4.
seem I had fix the issue. now all is well from kubernetes-dashboard.
add env FELIX_HEALTHHOST for calico-node to let felix bind spec.nodeName.
change felix readinessProbe to httpGet, and delete host: localhost, seems to work fine.
So, what's the different between httpGet 9099/readiness and exec calico-node -felix-ready?
225a226,230
> # FELIX_HEALTHHOST
> - name: FELIX_HEALTHHOST
> valueFrom:
> fieldRef:
> fieldPath: spec.nodeName
283c288
< host: localhost
---
> #host: 127.0.0.1
288,292c293,301
< exec:
< command:
< - /bin/calico-node
< - -bird-ready
< - -felix-ready
---
> httpGet:
> path: /readiness
> port: 9099
> # host: 127.0.0.1
> #exec:
> # command:
> # - /bin/calico-node
> # - -bird-ready
> # - -felix-ready
calico-node -felix-ready should just be doing an http get under the covers. It's mainly so we can wrap multiple liveness checks into a single command.
I haven't seen a need to set HEALTHHOST explicitly before, it's a bit odd. It should default to localhost per this: https://github.com/projectcalico/felix/blob/2a2fedd7e2831db07d4f36d2ddc928df783e19bb/config/config_params.go
What does localhost resolve to on this machine?
/ # ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from k8s-node.10-132-5-213.com (127.0.0.1): icmp_seq=1 ttl=64 time=0.068 ms
64 bytes from k8s-node.10-132-5-213.com (127.0.0.1): icmp_seq=2 ttl=64 time=0.060 ms
64 bytes from k8s-node.10-132-5-213.com (127.0.0.1): icmp_seq=3 ttl=64 time=0.042 ms
^C
the hostname k8s-node.10-132-5-213.com is config in /etc/hosts of the node
127.0.0.1 k8s-node.10-132-5-213.com localhost localhost.localdomain localhost4 localhost4.localdomain4
10.132.5.213 k8s-node.10-132-5-213.com
FELIX_HEALTHHOSTthe nodeName is 10.132.5.213
run calico-node -felix in the container, with env FELIX_HEALTHHOST, will say 10.132.5.213 is already used(calico-node -felix is normal now)
2018-09-05 02:54:07.002 [ERROR][9347] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 10.132.5.213:9099: bind: address already in use
2018-09-05 02:54:08.002 [ERROR][9347] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 10.132.5.213:9099: bind: address already in use
then delete the env FELIX_HEALTHHOST:
/ # export -n FELIX_HEALTHHOST
/ # env |grep FELIX
FELIX_IPV6SUPPORT=false
FELIX_IPINIPENABLED=true
FELIX_TYPHAK8SSERVICENAME=calico-typha
FELIX_LOGSEVERITYSCREEN=info
FELIX_DEFAULTENDPOINTTOHOSTACTION=ACCEPT
FELIX_HEALTHENABLED=true
FELIX_IPINIPMTU=1440
run calico-node -felix again, try to bind anothor non-exist ip 101.26.10.11, the ip is not the node ip, it is the lvs vip of nginx-ingress. and the node 10.132.5.213 is not member of the lvs.
2018-09-05 02:54:54.808 [ERROR][9400] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 101.26.10.11:9099: bind: cannot assign requested address
^C2018-09-05 02:54:54.933 [WARNING][9400] daemon.go 573: Felix is shutting down reason="Received OS signal interrupt"
2018-09-05 02:54:54.933 [FATAL][9400] daemon.go 636: Exiting immediately reason="Received OS signal interrupt"
alpine linux have no /etc/nsswitch.conf file, may be the cause of this issue?
package main
import (
"net"
"fmt"
"os"
)
func main() {
ns, err := net.LookupHost("localhost")
if err != nil {
fmt.Fprintf(os.Stderr, "Err: %s", err.Error())
return
}
for _, n := range ns {
fmt.Fprintf(os.Stdout, "--%s\n", n)
}
}
run
# with out nsswitch.conf (#hosts: files dns myhostname)
[root@repo tmp]# go run dns.go
--115.9.3.123
# with nsswitch.conf
[root@repo tmp]# vim /etc/nsswitch.conf
[root@repo tmp]#
[root@repo tmp]# go run dns.go
--127.0.0.1
reference: http://www.lijiaocn.com/%E9%97%AE%E9%A2%98/2017/11/09/problem-docker-not-use-hosts-file.html
@annProg hm, that's curious. Simply the existence of the file without any content fixes it?
/etc/nsswitch.conf with hosts: files dns myhostname
$ cat /etc/nsswitch.conf|grep host
hosts: files dns myhostname
@caseydavenport I'm wondering why this has been closed as I'm facing the same problem with the latest release (3.2.2). With nsswitch.conf missing felix binds to some weird address. This can be fixed by setting FELIX_HEALTHHOST to 127.0.0.1 circumventing the DNS query for localhost. But calico-node -bird-ready -felix-ready also tries to lookup localhost, gets the wrong answer and fails.
I think this is a bug in the image building process of calico. The process should create nsswitch.conf at some point and include it in the image. (Could it be that Alpine Linux included nsswitch.conf in the past?)
In the meantime I've added this to the manifest and everything comes up immediately without changing anything else:
command:
- /bin/sh
args:
- -c
- 'echo ''hosts: files dns myhostname'' >/etc/nsswitch.conf; exec start_runit'
Props to @annProg for figuring this out!
Blech, yes you're right.
It appears something has changed in alpine since the previous releases:
docker run -ti calico/node:v3.1.0 sh  ✔  10151  14:16:57
Unable to find image 'calico/node:v3.1.0' locally
v3.1.0: Pulling from calico/node
ff3a5c916c92: Already exists
955c3ca38037: Pull complete
975c4d33efdf: Pull complete
0ccec6808bb7: Pull complete
ecded7c91ddd: Pull complete
Digest: sha256:0ad99537793e468d18500f9ecb198ba3efa3fc89936e0c63465edb2732d14dee
Status: Downloaded newer image for calico/node:v3.1.0
/ # cat /etc/nsswitch.conf
hosts: files mdns4_minimal [NOTFOUND=return] dns mdns4
/ #
I've got a PR here which adds back the contents seen in earlier versions of Calico: https://github.com/projectcalico/node/pull/70
Great, thanks!
Fixed the same issue with the solution by @annProg, with following minor changes.
Change 1: Set the interface (since I had multiple interfaces)
- name: IP_AUTODETECTION_METHOD
value: "interface=eno.*"
Change 2: Replace liveness probe too with httpGet.
livenessProbe:
httpGet:
path: /liveness
port: 9099
Setup:
The problem seems to come from multiple interfaces. And - maybe - cgroupfs instead of systemd!?
I am a newbie in k8s and just followed the nice tutorial Building a Kubernetes Cluster in VirtualBox with Ubuntu, with a setup of 3 VirtualBoxes containing ubuntu-16.04.6-server-amd64.iso with docker 18.09.1 underneath of kubernetes 1.16.3 and configured with calico 3.10 (kubectl apply -f https://docs.projectcalico.org/v3.10/manifests/calico.yaml)
My network is configured as:
So, when I start my cluster with sudo kubeadm init --apiserver-advertise-address=192.168.249.20 --pod-network-cidr=192.168.0.0/16 --v=5 and watch kubectl get pods --all-namespaces I see everything up and running. After doing all the requested
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
and trying to join my worker1 to the cluster with
sudo kubeadm join 192.168.249.20:6443 --token t5viv4.t7mq0fw44mn0vzad \
--discovery-token-ca-cert-hash sha256:<hash> --v=2
I end up with kubectl describe pod -n kube-system calico-node-2jvq5 which shows me the same problem
Warning Unhealthy 25m (x1133 over 11h) kubelet, worker1 Liveness probe failed: calico/node is not ready: Felix is not live: Get http://localhost:9099/liveness: dial tcp [::1]:9099: connect: connection refused
I see all the fixes above but - as a newbie - I am a little bit lost. To get the long story short: **Why is this not fixed in calico 3.10.1 in a way, that it just works out of the box?
@JRGit4UE We have not determined a way to always be able to select the correct interface on a host when starting, if you have ideas and time to implement we would be happy to review designs and PRs to have better auto-detection of the interface that Calico uses.
If you have tried adjusting the auto-detection and Calico is selecting the correct interface/IP, yet the the liveness problem still persists please create a new bug and review or attach logs from a calico-node pod that is failing liveness checks.
Most helpful comment
The problem seems to come from multiple interfaces. And - maybe - cgroupfs instead of systemd!?
I am a newbie in k8s and just followed the nice tutorial Building a Kubernetes Cluster in VirtualBox with Ubuntu, with a setup of 3 VirtualBoxes containing ubuntu-16.04.6-server-amd64.iso with docker 18.09.1 underneath of kubernetes 1.16.3 and configured with calico 3.10 (
kubectl apply -f https://docs.projectcalico.org/v3.10/manifests/calico.yaml)My network is configured as:
So, when I start my cluster with
sudo kubeadm init --apiserver-advertise-address=192.168.249.20 --pod-network-cidr=192.168.0.0/16 --v=5and watchkubectl get pods --all-namespacesI see everything up and running. After doing all the requestedand trying to join my worker1 to the cluster with
I end up with
kubectl describe pod -n kube-system calico-node-2jvq5which shows me the same problemWarning Unhealthy 25m (x1133 over 11h) kubelet, worker1 Liveness probe failed: calico/node is not ready: Felix is not live: Get http://localhost:9099/liveness: dial tcp [::1]:9099: connect: connection refusedI see all the fixes above but - as a newbie - I am a little bit lost. To get the long story short: **Why is this not fixed in calico 3.10.1 in a way, that it just works out of the box?