I'm installing a k8s rig using:
https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml
Using these instructions:
https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/calico#installing-with-the-kubernetes-api-datastoremore-than-50-nodes
(so I've edited the manifest to raise the number of Typhas to be >0)
I expect that to work.
calico-node doesn't start - logs say:
2018-04-20 14:13:31.455 [INFO][71] sync_client.go 129: Connecting to Typha. address="10.99.27.206:5473" connID=0x0
2018-04-20 14:13:32.425 [INFO][71] health.go 150: Overall health summary=&health.HealthReport{Live:false, Ready:false}
2018-04-20 14:13:32.454 [ERROR][71] sync_client.go 132: Failed to connect to Typha, retrying... error=dial tcp 10.99.27.206:5473: getsockopt: connection refused
typha doesn't start (status=Pending) - describe says:
Warning FailedScheduling 2m (x33 over 12m) default-scheduler 0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
Looking at the nodes:
ubuntu@ip-10-0-1-91:~$ kubectl get no -o yaml | grep taint -A 5
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/master
status:
addresses:
- address: 10.0.1.91
Looking at the calico.yaml manifest, I see calico-node has more tolerations than typha. Given that calico-node needs typha to start, shouldn't those tolerations be the same?
Run this on a single node setup by kubeadm.
https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/calico#installing-with-the-kubernetes-api-datastoremore-than-50-nodes
@lwr20 do you have a cluster with more than 50 nodes? If so, are all of them tainted with some sort of NoSchedule taint?
Typically there will be at least one node in the cluster without such a taint. I believe it's desirable not to allow scheduling of Typha on every single node (same for kube-controllers). It's only calico/node that needs to run everywhere.
I was running a single node cluster in this case.
@lwr20 and I discussed. I've been a little concerned about typha scheduling for a while since this came up once or twice this week but @lwr20 confirmed that this just worked once he added a second node so that removes my worry (more thinking below to make sure I know why this works).
In the particular scenario above; if you're running a single-node cluster, I think you should be removing the "NoSchedule" taint, or it'll be a pretty useless cluster! I _think_ typha bootstrapping goes like this:
The problem in a single-node cluster is that, if your master has NoSchedule then typha won't start so we'll get stuck at that step.
yep, that order of operations is mostly correct, though I'd note that the kubelet doesn't remove the NoSchedule taint, just that it marks the node as Ready (not done through taints, but through node conditions).
Agreed that if no nodes in the cluster are schedule-able then Typha won't start and we'll get stuck, but like you said it's pretty useless to have a cluster with no schedule-able nodes, so I'm not sure there's anything to do.
If your cluster has schedule-able nodes all should go well I think.
To summarize, I don't think this is a bug so long as this works with at least one schedule-able node in the cluster.
If this still doesn't work when the master is made schedule-able, then we should identify why and fix that.
@lwr20 can you confirm if this is fixed when you make the master node schedule-able?
Most helpful comment
@lwr20 and I discussed. I've been a little concerned about typha scheduling for a while since this came up once or twice this week but @lwr20 confirmed that this just worked once he added a second node so that removes my worry (more thinking below to make sure I know why this works).
In the particular scenario above; if you're running a single-node cluster, I think you should be removing the "NoSchedule" taint, or it'll be a pretty useless cluster! I _think_ typha bootstrapping goes like this:
The problem in a single-node cluster is that, if your master has NoSchedule then typha won't start so we'll get stuck at that step.