calico node try to bind a non-exists ip address

Created on 26 Aug 2018  Â·  13Comments  Â·  Source: projectcalico/calico

calico node try to bind a non-exists ip address

Expected Behavior

my host ip is 10.12.100.23, the livenessProbe should bind 10.12.100.23:9099 or 127.0.0.1:9099 , but calico-node -felix try to bind a non-exists ip address.

Current Behavior

this is the log

2018-08-26 15:06:02.444 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address
2018-08-26 15:06:02.504 [INFO][68] int_dataplane.go 734: Applying dataplane updates
2018-08-26 15:06:02.504 [INFO][68] table.go 717: Invalidating dataplane cache ipVersion=0x4 reason="post update" table="raw"
2018-08-26 15:06:02.504 [INFO][68] table.go 438: Loading current iptables state and checking it is correct. ipVersion=0x4 table="raw"
2018-08-26 15:06:02.504 [INFO][68] table.go 717: Invalidating dataplane cache ipVersion=0x4 reason="post update" table="nat"
2018-08-26 15:06:02.504 [INFO][68] table.go 717: Invalidating dataplane cache ipVersion=0x4 reason="post update" table="mangle"
2018-08-26 15:06:02.504 [INFO][68] table.go 438: Loading current iptables state and checking it is correct. ipVersion=0x4 table="nat"
2018-08-26 15:06:02.504 [INFO][68] table.go 438: Loading current iptables state and checking it is correct. ipVersion=0x4 table="mangle"
2018-08-26 15:06:02.504 [INFO][68] table.go 717: Invalidating dataplane cache ipVersion=0x4 reason="post update" table="filter"
2018-08-26 15:06:02.505 [INFO][68] table.go 438: Loading current iptables state and checking it is correct. ipVersion=0x4 table="filter"
2018-08-26 15:06:02.508 [INFO][68] int_dataplane.go 748: Finished applying updates to dataplane. msecToApply=3.859451
2018-08-26 15:06:03.445 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address
2018-08-26 15:06:04.447 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address
bird: Mesh_10_112_35_117: State changed to start
2018-08-26 15:06:05.448 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address

the ip 115.18.44.218 is not my host ip, it is the default dns record of a non-exists domain in my company.

in kubernetes dashboard , the error message is

Readiness probe failed: calico/node is not ready: felix is not ready: Get http://localhost:9099/readiness: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Back-off restarting failed container


Possible Solution


Steps to Reproduce (for bugs)



1.
2.
3.
4.

Context


Your Environment

  • Calico version: 3.2.1
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.11.1
  • Operating System and version: CentOS Linux release 7.2.1511 (Core)
  • Link to your project (optional):
kinbug

Most helpful comment

The problem seems to come from multiple interfaces. And - maybe - cgroupfs instead of systemd!?
I am a newbie in k8s and just followed the nice tutorial Building a Kubernetes Cluster in VirtualBox with Ubuntu, with a setup of 3 VirtualBoxes containing ubuntu-16.04.6-server-amd64.iso with docker 18.09.1 underneath of kubernetes 1.16.3 and configured with calico 3.10 (kubectl apply -f https://docs.projectcalico.org/v3.10/manifests/calico.yaml)
My network is configured as:

  • adapter1="Host-Only Network" (enp0s3)
  • adapter2="NAT" (enp0s8)

So, when I start my cluster with sudo kubeadm init --apiserver-advertise-address=192.168.249.20 --pod-network-cidr=192.168.0.0/16 --v=5 and watch kubectl get pods --all-namespaces I see everything up and running. After doing all the requested

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

and trying to join my worker1 to the cluster with

sudo kubeadm join 192.168.249.20:6443 --token t5viv4.t7mq0fw44mn0vzad \
    --discovery-token-ca-cert-hash sha256:<hash> --v=2

I end up with kubectl describe pod -n kube-system calico-node-2jvq5 which shows me the same problem
Warning Unhealthy 25m (x1133 over 11h) kubelet, worker1 Liveness probe failed: calico/node is not ready: Felix is not live: Get http://localhost:9099/liveness: dial tcp [::1]:9099: connect: connection refused

I see all the fixes above but - as a newbie - I am a little bit lost. To get the long story short: **Why is this not fixed in calico 3.10.1 in a way, that it just works out of the box?

All 13 comments

seem I had fix the issue. now all is well from kubernetes-dashboard.

add env FELIX_HEALTHHOST for calico-node to let felix bind spec.nodeName.
change felix readinessProbe to httpGet, and delete host: localhost, seems to work fine.

So, what's the different between httpGet 9099/readiness and exec calico-node -felix-ready?

225a226,230
>             # FELIX_HEALTHHOST
>             - name: FELIX_HEALTHHOST
>               valueFrom:
>                 fieldRef:
>                   fieldPath: spec.nodeName
283c288
<               host: localhost
---
>               #host: 127.0.0.1
288,292c293,301
<             exec:
<               command:
<               - /bin/calico-node
<               - -bird-ready
<               - -felix-ready
---
>             httpGet:
>               path: /readiness
>               port: 9099
>             #  host: 127.0.0.1
>             #exec:
>             #  command:
>             #  - /bin/calico-node
>             #  - -bird-ready
>             #  - -felix-ready

calico-node -felix-ready should just be doing an http get under the covers. It's mainly so we can wrap multiple liveness checks into a single command.

I haven't seen a need to set HEALTHHOST explicitly before, it's a bit odd. It should default to localhost per this: https://github.com/projectcalico/felix/blob/2a2fedd7e2831db07d4f36d2ddc928df783e19bb/config/config_params.go

What does localhost resolve to on this machine?

ping localhost

/ # ping localhost
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from k8s-node.10-132-5-213.com (127.0.0.1): icmp_seq=1 ttl=64 time=0.068 ms
64 bytes from k8s-node.10-132-5-213.com (127.0.0.1): icmp_seq=2 ttl=64 time=0.060 ms
64 bytes from k8s-node.10-132-5-213.com (127.0.0.1): icmp_seq=3 ttl=64 time=0.042 ms
^C

the hostname k8s-node.10-132-5-213.com is config in /etc/hosts of the node

127.0.0.1       k8s-node.10-132-5-213.com  localhost localhost.localdomain localhost4 localhost4.localdomain4
10.132.5.213    k8s-node.10-132-5-213.com

env FELIX_HEALTHHOST

the nodeName is 10.132.5.213
run calico-node -felix in the container, with env FELIX_HEALTHHOST, will say 10.132.5.213 is already used(calico-node -felix is normal now)

2018-09-05 02:54:07.002 [ERROR][9347] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 10.132.5.213:9099: bind: address already in use
2018-09-05 02:54:08.002 [ERROR][9347] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 10.132.5.213:9099: bind: address already in use

then delete the env FELIX_HEALTHHOST:

/ # export -n FELIX_HEALTHHOST
/ # env |grep FELIX
FELIX_IPV6SUPPORT=false
FELIX_IPINIPENABLED=true
FELIX_TYPHAK8SSERVICENAME=calico-typha
FELIX_LOGSEVERITYSCREEN=info
FELIX_DEFAULTENDPOINTTOHOSTACTION=ACCEPT
FELIX_HEALTHENABLED=true
FELIX_IPINIPMTU=1440

run calico-node -felix again, try to bind anothor non-exist ip 101.26.10.11, the ip is not the node ip, it is the lvs vip of nginx-ingress. and the node 10.132.5.213 is not member of the lvs.

2018-09-05 02:54:54.808 [ERROR][9400] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 101.26.10.11:9099: bind: cannot assign requested address
^C2018-09-05 02:54:54.933 [WARNING][9400] daemon.go 573: Felix is shutting down reason="Received OS signal interrupt"
2018-09-05 02:54:54.933 [FATAL][9400] daemon.go 636: Exiting immediately reason="Received OS signal interrupt"

alpine linux have no /etc/nsswitch.conf file, may be the cause of this issue?

package main

import (
    "net"
    "fmt"
    "os"
)
func main() {


    ns, err := net.LookupHost("localhost")
    if err != nil {
        fmt.Fprintf(os.Stderr, "Err: %s", err.Error())
        return
    }

    for _, n := range ns {
        fmt.Fprintf(os.Stdout, "--%s\n", n) 
    }

}

run

# with out nsswitch.conf  (#hosts:      files dns myhostname)
[root@repo tmp]# go run dns.go 
--115.9.3.123
# with nsswitch.conf
[root@repo tmp]# vim /etc/nsswitch.conf
[root@repo tmp]# 
[root@repo tmp]# go run dns.go 
--127.0.0.1

reference: http://www.lijiaocn.com/%E9%97%AE%E9%A2%98/2017/11/09/problem-docker-not-use-hosts-file.html

@annProg hm, that's curious. Simply the existence of the file without any content fixes it?

/etc/nsswitch.conf with hosts: files dns myhostname

$ cat /etc/nsswitch.conf|grep host
hosts:      files dns myhostname

@caseydavenport I'm wondering why this has been closed as I'm facing the same problem with the latest release (3.2.2). With nsswitch.conf missing felix binds to some weird address. This can be fixed by setting FELIX_HEALTHHOST to 127.0.0.1 circumventing the DNS query for localhost. But calico-node -bird-ready -felix-ready also tries to lookup localhost, gets the wrong answer and fails.
I think this is a bug in the image building process of calico. The process should create nsswitch.conf at some point and include it in the image. (Could it be that Alpine Linux included nsswitch.conf in the past?)
In the meantime I've added this to the manifest and everything comes up immediately without changing anything else:

command:
  - /bin/sh
args:
  - -c
  - 'echo ''hosts: files dns myhostname'' >/etc/nsswitch.conf; exec start_runit'

Props to @annProg for figuring this out!

Blech, yes you're right.

It appears something has changed in alpine since the previous releases:

docker run -ti calico/node:v3.1.0 sh                                                                                                                                                                                                                ✔  10151  14:16:57 
Unable to find image 'calico/node:v3.1.0' locally
v3.1.0: Pulling from calico/node
ff3a5c916c92: Already exists 
955c3ca38037: Pull complete 
975c4d33efdf: Pull complete 
0ccec6808bb7: Pull complete 
ecded7c91ddd: Pull complete 
Digest: sha256:0ad99537793e468d18500f9ecb198ba3efa3fc89936e0c63465edb2732d14dee
Status: Downloaded newer image for calico/node:v3.1.0
/ # cat /etc/nsswitch.conf 
hosts: files mdns4_minimal [NOTFOUND=return] dns mdns4
/ # 

I've got a PR here which adds back the contents seen in earlier versions of Calico: https://github.com/projectcalico/node/pull/70

Great, thanks!

Fixed the same issue with the solution by @annProg, with following minor changes.

Change 1: Set the interface (since I had multiple interfaces)

- name: IP_AUTODETECTION_METHOD
     value: "interface=eno.*"

Change 2: Replace liveness probe too with httpGet.

livenessProbe:
  httpGet:
     path: /liveness
     port: 9099

Setup:

  • Docker version 19.03.2
  • Kubernetes 1.16.1
  • Calico 3.9
  • RedHat 7.3

The problem seems to come from multiple interfaces. And - maybe - cgroupfs instead of systemd!?
I am a newbie in k8s and just followed the nice tutorial Building a Kubernetes Cluster in VirtualBox with Ubuntu, with a setup of 3 VirtualBoxes containing ubuntu-16.04.6-server-amd64.iso with docker 18.09.1 underneath of kubernetes 1.16.3 and configured with calico 3.10 (kubectl apply -f https://docs.projectcalico.org/v3.10/manifests/calico.yaml)
My network is configured as:

  • adapter1="Host-Only Network" (enp0s3)
  • adapter2="NAT" (enp0s8)

So, when I start my cluster with sudo kubeadm init --apiserver-advertise-address=192.168.249.20 --pod-network-cidr=192.168.0.0/16 --v=5 and watch kubectl get pods --all-namespaces I see everything up and running. After doing all the requested

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

and trying to join my worker1 to the cluster with

sudo kubeadm join 192.168.249.20:6443 --token t5viv4.t7mq0fw44mn0vzad \
    --discovery-token-ca-cert-hash sha256:<hash> --v=2

I end up with kubectl describe pod -n kube-system calico-node-2jvq5 which shows me the same problem
Warning Unhealthy 25m (x1133 over 11h) kubelet, worker1 Liveness probe failed: calico/node is not ready: Felix is not live: Get http://localhost:9099/liveness: dial tcp [::1]:9099: connect: connection refused

I see all the fixes above but - as a newbie - I am a little bit lost. To get the long story short: **Why is this not fixed in calico 3.10.1 in a way, that it just works out of the box?

@JRGit4UE We have not determined a way to always be able to select the correct interface on a host when starting, if you have ideas and time to implement we would be happy to review designs and PRs to have better auto-detection of the interface that Calico uses.
If you have tried adjusting the auto-detection and Calico is selecting the correct interface/IP, yet the the liveness problem still persists please create a new bug and review or attach logs from a calico-node pod that is failing liveness checks.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vans88 picture vans88  Â·  5Comments

holmesb picture holmesb  Â·  5Comments

wjentner picture wjentner  Â·  5Comments

squat picture squat  Â·  5Comments

caseydavenport picture caseydavenport  Â·  6Comments