Calico: Googe Kubernetes Engine: Readiness and liveliness probe fails

Created on 11 Dec 2018  路  8Comments  路  Source: projectcalico/calico

I am operating a cluster on the google kubernetes engine:

Node version
1.11.4-gke.8
Node image
Container-Optimized OS (cos)
Machine type
g1-small (1 vCPU, 1.7 GB memory) 

I randomly ran into the following issue with a freshly created node pool via the google cloud UI:

The cluster is not able to start any of the calico nodes because of this network issue:

Readiness probe failed: Get http://localhost:9099/readiness: dial tcp [::1]:9099: connect: connection refused

Liveness probe failed: Get http://localhost:9099/liveness: dial tcp [::1]:9099: connect: connection refused

Name:               calico-node-7b7nw
Namespace:          kube-system
Priority:           2000001000
PriorityClassName:  system-node-critical
Node:               gke-proto-cluster-ha-1-pool-1-615658e1-m9wb/10.166.0.2
Start Time:         Tue, 11 Dec 2018 16:01:36 +0100
Labels:             controller-revision-hash=3442049184
                    k8s-app=calico-node
                    pod-template-generation=66
Annotations:        scheduler.alpha.kubernetes.io/critical-pod=
Status:             Running
IP:                 10.166.0.2
Controlled By:      DaemonSet/calico-node
Containers:
  calico-node:
    Container ID:   docker://8eca9d335446ea39898510d212963d099e8c7b959b858760de1b469d7ccd6727
    Image:          gcr.io/projectcalico-org/node:v3.2.4
    Image ID:       docker-pullable://gcr.io/projectcalico-org/node@sha256:f17f7afa96698563fbcd9b53b46f1ece8b2a9a043f43bdb0b5e17203ce4dfbf9
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 11 Dec 2018 16:08:29 +0100
      Finished:     Tue, 11 Dec 2018 16:08:30 +0100
    Ready:          False
    Restart Count:  6
    Requests:
      cpu:      100m
    Liveness:   http-get http://localhost:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  http-get http://localhost:9099/readiness delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      CALICO_DISABLE_FILE_LOGGING:        true
      CALICO_NETWORKING_BACKEND:          none
      DATASTORE_TYPE:                     kubernetes
      FELIX_DEFAULTENDPOINTTOHOSTACTION:  ACCEPT
      FELIX_HEALTHENABLED:                true
      FELIX_IGNORELOOSERPF:               true
      FELIX_IPTABLESMANGLEALLOWACTION:    RETURN
      FELIX_IPV6SUPPORT:                  false
      FELIX_LOGSEVERITYSYS:               none
      FELIX_LOGSEVERITYSCREEN:            warning
      FELIX_PROMETHEUSMETRICSENABLED:     true
      FELIX_REPORTINGINTERVALSECS:        0
      FELIX_TYPHAK8SSERVICENAME:          calico-typha
      IP:                                 
      NO_DEFAULT_POOLS:                   true
      NODENAME:                            (v1:spec.nodeName)
      WAIT_FOR_DATASTORE:                 true
    Mounts:
      /etc/calico from etc-calico (ro)
      /lib/modules from lib-modules (ro)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-sa-token-4z9gj (ro)
  install-cni:
    Container ID:  docker://d91ef14f516b1d5781584f3b97c4a427d68b6e9dd73cb123ebec76b0fe7e2161
    Image:         gcr.io/projectcalico-org/cni:v3.2.4
    Image ID:      docker-pullable://gcr.io/projectcalico-org/cni@sha256:0543787fa8be26349ad6512f28a71d3068fb7fe422cd5bdeb13791da65bff841
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Running
      Started:      Tue, 11 Dec 2018 16:02:04 +0100
    Ready:          True
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:       10-calico.conflist
      CNI_NETWORK_CONFIG:  {
  "name": "k8s-pod-network",
  "cniVersion": "0.3.0",
  "plugins": [
    {
      "type": "calico",
      "mtu": 1460,
      "log_level": "warning",
      "datastore_type": "kubernetes",
      "nodename": "__KUBERNETES_NODE_NAME__",
      "ipam": {
        "type": "host-local",
        "subnet": "usePodCidr"
      },
      "policy": {
        "type": "k8s",
        "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
      },
      "kubernetes": {
        "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
        "kubeconfig": "__KUBECONFIG_FILEPATH__"
      }
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    }
  ]
}
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-sa-token-4z9gj (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  etc-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/calico
    HostPathType:  
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /home/kubernetes/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:  
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:  
  calico-sa-token-4z9gj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-sa-token-4z9gj
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  projectcalico.org/ds-ready=true
Tolerations:     :NoSchedule
                 :NoExecute
                 :NoSchedule
                 :NoExecute
                 CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/network-unavailable:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type     Reason     Age               From                                                  Message
  ----     ------     ----              ----                                                  -------
  Normal   Pulling    7m                kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb  pulling image "gcr.io/projectcalico-org/node:v3.2.4"
  Normal   Pulled     6m                kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb  Successfully pulled image "gcr.io/projectcalico-org/node:v3.2.4"
  Normal   Pulling    6m                kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb  pulling image "gcr.io/projectcalico-org/cni:v3.2.4"
  Normal   Pulled     6m                kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb  Successfully pulled image "gcr.io/projectcalico-org/cni:v3.2.4"
  Normal   Created    6m                kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb  Created container
  Normal   Started    6m                kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb  Started container
  Warning  Unhealthy  6m (x2 over 6m)   kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb  Readiness probe failed: Get http://localhost:9099/readiness: dial tcp [::1]:9099: connect: connection refused
  Warning  Unhealthy  6m (x3 over 6m)   kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb  Liveness probe failed: Get http://localhost:9099/liveness: dial tcp [::1]:9099: connect: connection refused
  Normal   Created    5m (x3 over 6m)   kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb  Created container
  Normal   Started    5m (x3 over 6m)   kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb  Started container
  Normal   Pulled     5m (x3 over 6m)   kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb  Container image "gcr.io/projectcalico-org/node:v3.2.4" already present on machine
  Warning  BackOff    2m (x23 over 6m)  kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb  Back-off restarting failed container

How can this be case for a newly created node pool that is managed by google and how can this be resolved?
I don't have changed anything in the kube-system part..

Thanks a lot for your support :)

kinsupport

Most helpful comment

If you're trying to figure out how to set net.ipv4.conf.all.rp_filter it is a sysctl setting, https://www.google.com/search?q=setting+sysctl+parameters&oq=setting+sysctl should help with setting those.
If you want to set IgnoreLooseRPF (not recommended) see https://docs.projectcalico.org/v3.4/reference/felix/configuration.

All 8 comments

Hm, only thing I can think of is if the container is failing to run Felix for some reason.

Are there any ERROR/FATAL logs (e.g. kubectl logs -n kube-system calico-node-7b7nw calico-node) that might indicate a problem?

@tech348712013870132 any more information on this one?

I just followed the guide here: https://docs.projectcalico.org/v3.4/getting-started/kubernetes/ and ran into this same issue. I looked at the logs with: kubectl logs -n kube-system calico-node-szjc8 | grep FATAL and it prints:

2019-01-02 16:07:02.195 [FATAL][733] int_dataplane.go 824: Kernel's RPF check is set to 'loose'.  This would allow endpoints to spoof their IP address.  Calico requires net.ipv4.conf.all.rp_filter to be set to 0 or 1. If you require loose RPF and you are not concerned about spoofing, this check can be disabled by setting the IgnoreLooseRPF configuration parameter to 'true'.

Trying to figure out how I can change this but not having much luck.

If you're trying to figure out how to set net.ipv4.conf.all.rp_filter it is a sysctl setting, https://www.google.com/search?q=setting+sysctl+parameters&oq=setting+sysctl should help with setting those.
If you want to set IgnoreLooseRPF (not recommended) see https://docs.projectcalico.org/v3.4/reference/felix/configuration.

A temporary workaround for the google kubernetes engine is to deactivate the network policy via calico in the cluster dashboard.

@tech348712013870132 could you check the logs as suggested above? Otherwise I can't tell what the root cause is.

One of my GKE clusters is in the same state, following an upgrade from Kubernetes 1.10 to v1.11.7-gke.6.

All nodes have status Ready,SchedulingDisabled:

NAME                                     STATUS                     ROLES     AGE       VERSION         EXTERNAL-IP      OS-IMAGE                             KERNEL-VERSION   CONTAINER-RUNTIME
gke-staging-default-pool-914b1510-0888   Ready,SchedulingDisabled   <none>    1h        v1.11.7-gke.6   xx.xx.xx.xx      Container-Optimized OS from Google   4.14.91+         docker://17.3.2
gke-staging-default-pool-914b1510-0q69   Ready,SchedulingDisabled   <none>    1h        v1.11.7-gke.6   xx.xx.xx.xx      Container-Optimized OS from Google   4.14.91+         docker://17.3.2
gke-staging-default-pool-914b1510-hsck   Ready,SchedulingDisabled   <none>    1h        v1.11.7-gke.6   xx.xx.xx.xx      Container-Optimized OS from Google   4.14.91+         docker://17.3.2
gke-staging-default-pool-914b1510-tt25   Ready,SchedulingDisabled   <none>    1h        v1.11.7-gke.6   xx.xx.xx.xx      Container-Optimized OS from Google   4.14.91+         docker://17.3.2

In all of the calico-node containers in the kube-system namespace, I see multiple log messages of the format:

[FATAL][12127] daemon.go 445: Failed to connect to Typha error=dial tcp 10.63.253.250:5473: connect: connection refused

That seems likely, since all of the typha pods are pending:

$ kubectl -n kube-system get po | grep typha
calico-typha-5b857668fd-bqwjb                         0/1       Pending   0          32m
calico-typha-horizontal-autoscaler-5ff7f558cc-s85c6   0/1       Pending   0          32m
calico-typha-vertical-autoscaler-5d4bf57df5-8mkhv     0/1       Pending   0          32m

I'm not too familiar with the setup, but it seems like a bit of a chicken-and-egg scenario: the typha pod is pending due to node scheduling being disabled, and node scheduling is disabled due to calico-node failing to reach typha and failing healthchecks?


To fix this issue, I disabled network policy for the cluster and added a new node pool. The new nodes came up successfully, and I removed the old node pool.

@klingerf thanks for reporting.

I don't think this particular instance is a chicken-and-egg scenario - the calico-node pods failing to reach Typha shouldn't result in the nodes being unscheduleable. I suspect there is another reason that Typha is not being scheduled on those nodes.

I'd recommend checking the output of kubectl describe on one of the typha pods and seeing what events have been reported. Specifically I'd look for things like lack of resources, or taints preventing scheduling.

Was this page helpful?
0 / 5 - 0 ratings