Calico: Typha is missing tolerations

Created on 20 Apr 2018 · 5Comments · Source: projectcalico/calico

I'm installing a k8s rig using:
https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml

Using these instructions:
https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/calico#installing-with-the-kubernetes-api-datastoremore-than-50-nodes
(so I've edited the manifest to raise the number of Typhas to be >0)

Expected Behavior

I expect that to work.

Current Behavior

calico-node doesn't start - logs say:

2018-04-20 14:13:31.455 [INFO][71] sync_client.go 129: Connecting to Typha. address="10.99.27.206:5473" connID=0x0
2018-04-20 14:13:32.425 [INFO][71] health.go 150: Overall health summary=&health.HealthReport{Live:false, Ready:false}
2018-04-20 14:13:32.454 [ERROR][71] sync_client.go 132: Failed to connect to Typha, retrying... error=dial tcp 10.99.27.206:5473: getsockopt: connection refused

typha doesn't start (status=Pending) - describe says:

Warning  FailedScheduling  2m (x33 over 12m)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

Looking at the nodes:

ubuntu@ip-10-0-1-91:~$ kubectl get no -o yaml | grep taint -A 5
    taints:
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
  status:
    addresses:
    - address: 10.0.1.91

Possible Solution

Looking at the calico.yaml manifest, I see calico-node has more tolerations than typha. Given that calico-node needs typha to start, shouldn't those tolerations be the same?

Steps to Reproduce (for bugs)

Run this on a single node setup by kubeadm.
https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/calico#installing-with-the-kubernetes-api-datastoremore-than-50-nodes

kinbug

Source

lwr20

Most helpful comment

@lwr20 and I discussed. I've been a little concerned about typha scheduling for a while since this came up once or twice this week but @lwr20 confirmed that this just worked once he added a second node so that removes my worry (more thinking below to make sure I know why this works).

In the particular scenario above; if you're running a single-node cluster, I think you should be removing the "NoSchedule" taint, or it'll be a pretty useless cluster! I _think_ typha bootstrapping goes like this:

kubelets mark all nodes as NotReady due to lack of CNI plugin
Initially, typha can't be scheduled, but calico/node can.
calico/node gets scheduled to each host
calico/node itself can't start (due to no typha) but the secondary container inside the pod installs the CNI plugin
kubelet removes the NotReady taint
typha gets scheduled
calico/node connects to typha
done

The problem in a single-node cluster is that, if your master has NoSchedule then typha won't start so we'll get stuck at that step.

fasaxc on 23 Apr 2018

👍2

All 5 comments

@lwr20 do you have a cluster with more than 50 nodes? If so, are all of them tainted with some sort of NoSchedule taint?

Typically there will be at least one node in the cluster without such a taint. I believe it's desirable not to allow scheduling of Typha on every single node (same for kube-controllers). It's only calico/node that needs to run everywhere.

caseydavenport on 20 Apr 2018

I was running a single node cluster in this case.

lwr20 on 23 Apr 2018

kubelets mark all nodes as NotReady due to lack of CNI plugin
Initially, typha can't be scheduled, but calico/node can.
calico/node gets scheduled to each host
calico/node itself can't start (due to no typha) but the secondary container inside the pod installs the CNI plugin
kubelet removes the NotReady taint
typha gets scheduled
calico/node connects to typha
done

The problem in a single-node cluster is that, if your master has NoSchedule then typha won't start so we'll get stuck at that step.

fasaxc on 23 Apr 2018

👍2

yep, that order of operations is mostly correct, though I'd note that the kubelet doesn't remove the NoSchedule taint, just that it marks the node as Ready (not done through taints, but through node conditions).

Agreed that if no nodes in the cluster are schedule-able then Typha won't start and we'll get stuck, but like you said it's pretty useless to have a cluster with no schedule-able nodes, so I'm not sure there's anything to do.

If your cluster has schedule-able nodes all should go well I think.

caseydavenport on 23 Apr 2018

To summarize, I don't think this is a bug so long as this works with at least one schedule-able node in the cluster.

If this still doesn't work when the master is made schedule-able, then we should identify why and fix that.

@lwr20 can you confirm if this is fixed when you make the master node schedule-able?

caseydavenport on 23 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Cannot access kube-api (with Service-IP) from a pod in default namespace

wjentner · 5Comments

support for NF_NAT_RANGE_PROTO_RANDOM_FULLY in SNAT rules

redbaron · 3Comments

Calico Network Policy only works when both application are on same K8s node

venomwaqar · 5Comments

Add non-cluster host to Kubernets Calico network

AlejoAsd · 5Comments

NodeNetworkUnavailable condition fix not included in v3.4.0 as release notes suggest

ffilippopoulos · 4Comments