I am operating a cluster on the google kubernetes engine:
Node version
1.11.4-gke.8
Node image
Container-Optimized OS (cos)
Machine type
g1-small (1 vCPU, 1.7 GB memory)
I randomly ran into the following issue with a freshly created node pool via the google cloud UI:
The cluster is not able to start any of the calico nodes because of this network issue:
Readiness probe failed: Get http://localhost:9099/readiness: dial tcp [::1]:9099: connect: connection refused
Liveness probe failed: Get http://localhost:9099/liveness: dial tcp [::1]:9099: connect: connection refused
Name: calico-node-7b7nw
Namespace: kube-system
Priority: 2000001000
PriorityClassName: system-node-critical
Node: gke-proto-cluster-ha-1-pool-1-615658e1-m9wb/10.166.0.2
Start Time: Tue, 11 Dec 2018 16:01:36 +0100
Labels: controller-revision-hash=3442049184
k8s-app=calico-node
pod-template-generation=66
Annotations: scheduler.alpha.kubernetes.io/critical-pod=
Status: Running
IP: 10.166.0.2
Controlled By: DaemonSet/calico-node
Containers:
calico-node:
Container ID: docker://8eca9d335446ea39898510d212963d099e8c7b959b858760de1b469d7ccd6727
Image: gcr.io/projectcalico-org/node:v3.2.4
Image ID: docker-pullable://gcr.io/projectcalico-org/node@sha256:f17f7afa96698563fbcd9b53b46f1ece8b2a9a043f43bdb0b5e17203ce4dfbf9
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 11 Dec 2018 16:08:29 +0100
Finished: Tue, 11 Dec 2018 16:08:30 +0100
Ready: False
Restart Count: 6
Requests:
cpu: 100m
Liveness: http-get http://localhost:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
Readiness: http-get http://localhost:9099/readiness delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
CALICO_DISABLE_FILE_LOGGING: true
CALICO_NETWORKING_BACKEND: none
DATASTORE_TYPE: kubernetes
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
FELIX_HEALTHENABLED: true
FELIX_IGNORELOOSERPF: true
FELIX_IPTABLESMANGLEALLOWACTION: RETURN
FELIX_IPV6SUPPORT: false
FELIX_LOGSEVERITYSYS: none
FELIX_LOGSEVERITYSCREEN: warning
FELIX_PROMETHEUSMETRICSENABLED: true
FELIX_REPORTINGINTERVALSECS: 0
FELIX_TYPHAK8SSERVICENAME: calico-typha
IP:
NO_DEFAULT_POOLS: true
NODENAME: (v1:spec.nodeName)
WAIT_FOR_DATASTORE: true
Mounts:
/etc/calico from etc-calico (ro)
/lib/modules from lib-modules (ro)
/var/lib/calico from var-lib-calico (rw)
/var/run/calico from var-run-calico (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-sa-token-4z9gj (ro)
install-cni:
Container ID: docker://d91ef14f516b1d5781584f3b97c4a427d68b6e9dd73cb123ebec76b0fe7e2161
Image: gcr.io/projectcalico-org/cni:v3.2.4
Image ID: docker-pullable://gcr.io/projectcalico-org/cni@sha256:0543787fa8be26349ad6512f28a71d3068fb7fe422cd5bdeb13791da65bff841
Port: <none>
Host Port: <none>
Command:
/install-cni.sh
State: Running
Started: Tue, 11 Dec 2018 16:02:04 +0100
Ready: True
Restart Count: 0
Environment:
CNI_CONF_NAME: 10-calico.conflist
CNI_NETWORK_CONFIG: {
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"plugins": [
{
"type": "calico",
"mtu": 1460,
"log_level": "warning",
"datastore_type": "kubernetes",
"nodename": "__KUBERNETES_NODE_NAME__",
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s",
"k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
},
"kubernetes": {
"k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
"kubeconfig": "__KUBECONFIG_FILEPATH__"
}
},
{
"type": "portmap",
"capabilities": {"portMappings": true},
"snat": true
}
]
}
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-sa-token-4z9gj (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
etc-calico:
Type: HostPath (bare host directory volume)
Path: /etc/calico
HostPathType:
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /home/kubernetes/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType:
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType:
calico-sa-token-4z9gj:
Type: Secret (a volume populated by a Secret)
SecretName: calico-sa-token-4z9gj
Optional: false
QoS Class: Burstable
Node-Selectors: projectcalico.org/ds-ready=true
Tolerations: :NoSchedule
:NoExecute
:NoSchedule
:NoExecute
CriticalAddonsOnly
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 7m kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb pulling image "gcr.io/projectcalico-org/node:v3.2.4"
Normal Pulled 6m kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb Successfully pulled image "gcr.io/projectcalico-org/node:v3.2.4"
Normal Pulling 6m kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb pulling image "gcr.io/projectcalico-org/cni:v3.2.4"
Normal Pulled 6m kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb Successfully pulled image "gcr.io/projectcalico-org/cni:v3.2.4"
Normal Created 6m kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb Created container
Normal Started 6m kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb Started container
Warning Unhealthy 6m (x2 over 6m) kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb Readiness probe failed: Get http://localhost:9099/readiness: dial tcp [::1]:9099: connect: connection refused
Warning Unhealthy 6m (x3 over 6m) kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb Liveness probe failed: Get http://localhost:9099/liveness: dial tcp [::1]:9099: connect: connection refused
Normal Created 5m (x3 over 6m) kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb Created container
Normal Started 5m (x3 over 6m) kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb Started container
Normal Pulled 5m (x3 over 6m) kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb Container image "gcr.io/projectcalico-org/node:v3.2.4" already present on machine
Warning BackOff 2m (x23 over 6m) kubelet, gke-proto-cluster-ha-1-pool-1-615658e1-m9wb Back-off restarting failed container
How can this be case for a newly created node pool that is managed by google and how can this be resolved?
I don't have changed anything in the kube-system part..
Thanks a lot for your support :)
Hm, only thing I can think of is if the container is failing to run Felix for some reason.
Are there any ERROR/FATAL logs (e.g. kubectl logs -n kube-system calico-node-7b7nw calico-node) that might indicate a problem?
@tech348712013870132 any more information on this one?
I just followed the guide here: https://docs.projectcalico.org/v3.4/getting-started/kubernetes/ and ran into this same issue. I looked at the logs with: kubectl logs -n kube-system calico-node-szjc8 | grep FATAL and it prints:
2019-01-02 16:07:02.195 [FATAL][733] int_dataplane.go 824: Kernel's RPF check is set to 'loose'. This would allow endpoints to spoof their IP address. Calico requires net.ipv4.conf.all.rp_filter to be set to 0 or 1. If you require loose RPF and you are not concerned about spoofing, this check can be disabled by setting the IgnoreLooseRPF configuration parameter to 'true'.
Trying to figure out how I can change this but not having much luck.
If you're trying to figure out how to set net.ipv4.conf.all.rp_filter it is a sysctl setting, https://www.google.com/search?q=setting+sysctl+parameters&oq=setting+sysctl should help with setting those.
If you want to set IgnoreLooseRPF (not recommended) see https://docs.projectcalico.org/v3.4/reference/felix/configuration.
A temporary workaround for the google kubernetes engine is to deactivate the network policy via calico in the cluster dashboard.
@tech348712013870132 could you check the logs as suggested above? Otherwise I can't tell what the root cause is.
One of my GKE clusters is in the same state, following an upgrade from Kubernetes 1.10 to v1.11.7-gke.6.
All nodes have status Ready,SchedulingDisabled:
NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
gke-staging-default-pool-914b1510-0888 Ready,SchedulingDisabled <none> 1h v1.11.7-gke.6 xx.xx.xx.xx Container-Optimized OS from Google 4.14.91+ docker://17.3.2
gke-staging-default-pool-914b1510-0q69 Ready,SchedulingDisabled <none> 1h v1.11.7-gke.6 xx.xx.xx.xx Container-Optimized OS from Google 4.14.91+ docker://17.3.2
gke-staging-default-pool-914b1510-hsck Ready,SchedulingDisabled <none> 1h v1.11.7-gke.6 xx.xx.xx.xx Container-Optimized OS from Google 4.14.91+ docker://17.3.2
gke-staging-default-pool-914b1510-tt25 Ready,SchedulingDisabled <none> 1h v1.11.7-gke.6 xx.xx.xx.xx Container-Optimized OS from Google 4.14.91+ docker://17.3.2
In all of the calico-node containers in the kube-system namespace, I see multiple log messages of the format:
[FATAL][12127] daemon.go 445: Failed to connect to Typha error=dial tcp 10.63.253.250:5473: connect: connection refused
That seems likely, since all of the typha pods are pending:
$ kubectl -n kube-system get po | grep typha
calico-typha-5b857668fd-bqwjb 0/1 Pending 0 32m
calico-typha-horizontal-autoscaler-5ff7f558cc-s85c6 0/1 Pending 0 32m
calico-typha-vertical-autoscaler-5d4bf57df5-8mkhv 0/1 Pending 0 32m
I'm not too familiar with the setup, but it seems like a bit of a chicken-and-egg scenario: the typha pod is pending due to node scheduling being disabled, and node scheduling is disabled due to calico-node failing to reach typha and failing healthchecks?
To fix this issue, I disabled network policy for the cluster and added a new node pool. The new nodes came up successfully, and I removed the old node pool.
@klingerf thanks for reporting.
I don't think this particular instance is a chicken-and-egg scenario - the calico-node pods failing to reach Typha shouldn't result in the nodes being unscheduleable. I suspect there is another reason that Typha is not being scheduled on those nodes.
I'd recommend checking the output of kubectl describe on one of the typha pods and seeing what events have been reported. Specifically I'd look for things like lack of resources, or taints preventing scheduling.
Most helpful comment
If you're trying to figure out how to set net.ipv4.conf.all.rp_filter it is a sysctl setting, https://www.google.com/search?q=setting+sysctl+parameters&oq=setting+sysctl should help with setting those.
If you want to set IgnoreLooseRPF (not recommended) see https://docs.projectcalico.org/v3.4/reference/felix/configuration.