Autoscaler: Autoscaler is not scaling down or up

Created on 9 Jun 2018 · 2Comments · Source: kubernetes/autoscaler

Hello,

I have installed auto scaler using helm, the command used is as below

helm install stable/cluster-autoscaler \
  --namespace kube-system \
  --name cluster-autoscaler \
  --set image.tag=v1.1.2 \
  --set awsRegion=$REGION \
  --set rbac.create=true \
  --set autoscalingGroups\[0\].name=nodes.$NAME.$DNS_ZONE \
  --set autoscalingGroups\[0\].minSize=$MIN_NODES \
  --set autoscalingGroups\[0\].maxSize=$MAX_NODES \
  --set podAnnotations."iam\.amazonaws\.com/role"=arn:aws:iam::$ACCOUNT_NUMBER:role/masters.$DNS_ZONE \
  --set nodeSelector."node-role\.kubernetes\.io/master"="" \
  --set tolerations\[0\].effect=NoSchedule \
  --set tolerations\[0\].key="node-role.kubernetes.io/master"

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-27T00:13:02Z", GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T11:55:20Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

The cluster is having 15 nodes and one master. There are no much pods installed and I wanted to see if auto scaler will scale down. From the logs.

  1 I0609 09:33:13.875968       1 static_autoscaler.go:332] Starting scale down
  2 I0609 09:33:13.902816       1 scale_down.go:387] ip-4-5-6-7.region.compute.internal was unneeded for 6m48.344665061s
  3 I0609 09:33:13.902839       1 scale_down.go:387] 1.2.3.region.compute.internal was unneeded for 6m58.532568459s
  4 I0609 09:33:13.902849       1 scale_down.go:446] No candidates for scale down
  5 I0609 09:33:14.243737       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
  6 I0609 09:33:16.315108       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
  7 I0609 09:33:18.345279       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
  8 I0609 09:33:20.352730       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
  9 I0609 09:33:22.360454       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
 10 I0609 09:33:23.916035       1 static_autoscaler.go:108] Starting main loop
 11 I0609 09:33:24.023676       1 static_autoscaler.go:240] Filtering out schedulables
 12 I0609 09:33:24.024058       1 static_autoscaler.go:250] No schedulable pods
 13 I0609 09:33:24.024076       1 static_autoscaler.go:257] No unschedulable pods
 14 I0609 09:33:24.024085       1 static_autoscaler.go:299] Calculating unneeded nodes
 15 I.b.c.d:%s/example.ip0609 09:33:24.059105       1 utils.go:399] Skipping ip-6-7-8-9.region.compute.internal - no node group config
 16 I0609 09:33:24.059593       1 scale_down.go:175] Scale-down calculation: ignoring 12 nodes, that were unremovable in the last 5m0s
 17 I0609 09:33:24.059612       1 scale_down.go:207] Node ip-4-5-6-7.region.compute.internal - utilization 0.162500
 18 I0609 09:33:24.059625       1 scale_down.go:207] Node 1.2.3.region.compute.internal - utilization 0.216500
 19 I0609 09:33:24.059638       1 scale_down.go:207] Node example.ip.compute.internal - utilization 0.725000
 20 I0609 09:33:24.059644       1 scale_down.go:211] Node example.ip.compute.internal is not suitable for removal - utilization too big (0.725000)
 21 I0609 09:33:24.101380       1 cluster.go:78] Fast evaluation: 1.2.3.region.compute.internal for removal
 22 I0609 09:33:24.101519       1 cluster.go:200] Pod monitoring/prometheus-operator-5d564c684d-x8shd can be moved to ip-172-20-51-4.eu-west-1.compute.internal
 23 I0609 09:33:24.101724       1 cluster.go:200] Pod monitoring/kube-prometheus-exporter-kube-state-5cd969f745-td4z9 can be moved to ip-172-20-51-4.eu-west-1.compute.internal
 24 I0609 09:33:24.102028       1 cluster.go:109] Fast evaluation: node 1.2.3.region.compute.internal may be removed
 25 I0609 09:33:24.102193       1 static_autoscaler.go:314] 1.2.3.region.compute.internal is unneeded since 2018-06-09 09:26:15.199568305 +0000 UTC duration 7m8.716444732s
 26 I0609 09:33:24.102216       1 static_autoscaler.go:314] ip-4-5-6-7.region.compute.internal is unneeded since 2018-06-09 09:26:25.387471703 +0000 UTC duration 6m58.528541334s
 27 I0609 09:33:24.102226       1 static_autoscaler.go:329] Scale down status: unneededOnly=false lastScaleUpTime=2018-06-06 21:12:31.471420721 +0000 UTC lastScaleDownDeleteTime=    2018-06-06 21:12:31.471421196 +0000 UTC lastScaleDownFailTime=2018-06-06 21:12:31.471421535 +0000 UTC schedulablePodsPresent=false isDeleteInProgress=false
 28 I0609 09:33:24.102243       1 static_autoscaler.go:332] Starting scale down
 29 I0609 09:33:24.139374       1 scale_down.go:387] ip-4-5-6-7.region.compute.internal was unneeded for 6m58.528541334s
 30 I0609 09:33:24.139397       1 scale_down.go:387] 1.2.3.region.compute.internal was unneeded for 7m8.716444732s
 31 I0609 09:33:24.139408       1 scale_down.go:446] No candidates for scale down
 32 I0609 09:33:24.368897       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler

Can some one give me a clue !?

I have reduced number of pods running in kube-system as I read auto scaler will not scale down a node by default if there is a pod on it which is running in kube-system.

Thanks

Source

r-divakaran-hrs

Most helpful comment

I think I found the cause. I am using KOPS to create my cluster and in the instance group setting the min size of nodes were set to 15, this could be conflicting when cluster-autoscaler was trying to remove node size.
After I updated the nodes minSize and maxSize in KOPS config to match with that in cluster-autoscaler, scale down happened.

r-divakaran-hrs on 11 Jun 2018

👍3

All 2 comments

I see this behaviour from logs. These are settings I am having

    Command:
      ./cluster-autoscaler
      --cloud-provider=aws
      --namespace=kube-system
      --nodes=5:15:nodes.qa-lab.com
      --logtostderr=true
      --stderrthreshold=info
      --v=4

But from logs

I0611 08:09:08.533810       1 factory.go:33] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"0a16f454-6c83-11e8-83c0-0a7944bf8e00", APIVersion:"v1", ResourceVersion:"32365261", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' Scale-down: removing empty node ip-a-b-c-d.eu-west-1.compute.internal
I0611 08:09:08.540473       1 delete.go:53] Successfully added toBeDeletedTaint on node ip-a-b-c-d.eu-west-1.compute.internal
E0611 08:09:08.617737       1 scale_down.go:641] Problem with empty node deletion: failed to delete ip-a-b-c-d.eu-west-1.compute.internal: ValidationError: Currently, desiredSize equals minSize (15). Terminating instance without replacement will violate group's min size constraint. Either set shouldDecrementDesiredCapacity flag to false or lower group's min size.
    status code: 400, request id: b75b2329-6d4e-11e8-8abc-51bdbac6ff0c
E0611 08:09:08.617781       1 static_autoscaler.go:350] Failed to scale down: <nil>
I0611 08:09:08.620770       1 delete.go:106] Releasing taint {Key:ToBeDeletedByClusterAutoscaler Value:1528704548 Effect:NoSchedule TimeAdded:<nil>} on node ip-a-b-c-d.eu-west-1.compute.internal
I0611 08:09:08.625658       1 delete.go:119] Successfully released toBeDeletedTaint on node ip-a-b-c-d.eu-west-1.compute.internal
I0611 08:09:08.625982       1 factory.go:33] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-a-b-c-d.eu-west-1.compute.internal", UID:"05cbc43c-3f1e-11e8-8ca5-0a003cf6c876", APIVersion:"v1", ResourceVersion:"32365311", FieldPath:""}): type: 'Warning' reason: 'ScaleDownFailed' failed to delete empty node: failed to delete ip-a-b-c-d.eu-west-1.compute.internal: ValidationError: Currently, desiredSize equals minSize (15). Terminating instance without replacement will violate group's min size constraint. Either set shouldDecrementDesiredCapacity flag to false or lower group's min size.

Thanks

r-divakaran-hrs on 11 Jun 2018