Kops: Cluster autoscaler timeout when trying to run on master node

Created on 17 Jul 2018 · 6Comments · Source: kubernetes/kops

I originally posted this here: https://github.com/kubernetes/autoscaler/issues/1064

Greetings,

I'm running cluster-autoscaler as part of a Kops cluster running on AWS. The problem is simple: when I run cluster-autoscaler on a regular node, it works fine, but when it's running on a master node, the pod times out, receives a CrashLoopBackOff, and then tries again to no avail. The following log message is the most applicable

F0715 16:55:30.458743       1 main.go:319] Failed to get nodes from apiserver: Get https://100.64.0.1:443/api/v1/nodes: dial tcp 100.64.0.1:443: i/o timeout
goroutine 1 [running]:

Kops version: 1.9.1
Kubernetes version: 1.9.6
Cloud provider: AWS
Commands ran: Followed the commands in the Kops cluster-autoscaler documention
What happened after the commands executed?: Pod starts, and then goes into a CrashLoopBackOff loops
What did you expect to happen?: The cluster-autoscaler app goes into its normal loop of checking if the ASG is in it's target node size.

Extra notes:

I'm using the Calico networking stack
I'm using the Kube2IAM project. Both the master and node IAM roles have the proper permissions to assume the role I provide in the annotation like so:

annotations:
  iam.amazonaws.com/role: CompanyIamRole-ClusterAutoscalerEc2

Cluster manifest and full log output: https://gist.github.com/sc250024/81525c85f3cdfc60349b3bfdcce755af

Source

sc250024

Most helpful comment

I had the exact same issue and I fixed it by deleting the calico-node pod on the master node, then deleting auto-scaler pod. Autoscaler started fine after.

komljen on 31 Jul 2018

👍3

All 6 comments

I had the exact same issue and I fixed it by deleting the calico-node pod on the master node, then deleting auto-scaler pod. Autoscaler started fine after.

komljen on 31 Jul 2018

👍3

I am thinking to use cluster-autoscaler charts instead but they don't have a taint of running only on master nodes, which kops autoscaler resource has. So, i was wondering if there is a specific reason to run cluster-autoscaler on master nodes (instead of running them on nodes)? (Is it to prevent cluster-autoscaler running node itself to get removed by autoscaler?)

prat0318 on 22 Oct 2018

@prat0318 You can use the chart, and run it on master. Put the following in your values.yaml file:

First, make the master node available for scheduling:

tolerations:
  - effect: "NoSchedule"
    key: "node-role.kubernetes.io/master"

Optionally, you can run it only on master

nodeSelector:
  kubernetes.io/role: "master"

sc250024 on 22 Oct 2018

❤1

@sc250024 Thanks a lot, that worked. Still wondering if there is a benefit of running it on master nodes instead of regular worker nodes?

prat0318 on 23 Oct 2018

@prat0318 It depends on your setup of course. I think the idea of running them on the master nodes is that the master nodes are more stable compared to the worker nodes. Personally, my worker nodes are all run on spot instances whereas the master nodes are dedicated.

If you're using EKS, AFAIK you can't schedule pods on the master nodes since they're managed by AWS. In that case, you can use the following podAnnotation to give the Cluster Autoscaler pods more priority on your worker nodes:

podAnnotations:
  scheduler.alpha.kubernetes.io/critical-pod: ""

sc250024 on 23 Oct 2018

👍1

Thanks @sc250024 !

prat0318 on 25 Oct 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

kopeio vs kopeio-vxlan

olalonde · 4Comments

Automatic Instance Group labels for Node Affinity scheduling

RXminuS · 5Comments

CoreDNS externalCoreFile Parsing Invalid - Indentation

joshbranham · 3Comments

Kubectl top nodes not working with the metrics server

minasys · 3Comments

error: error validating "cluster-autoscaler.yml": error validating data: found invalid field tolerations for v1.PodSpec; if you choose to ignore these errors, turn validation off with --validate=false

endejoli · 4Comments