Autoscaler: scaling-down nodes that are running only system pods

Created on 11 May 2017  路  7Comments  路  Source: kubernetes/autoscaler

I'm using:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:33:11Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:22:08Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

What I did:

  • I created a one-node GKE cluster, cluster autoscaling disabled
  • At the time of creation, all kube-system pods were running on the one node
  • I subsequently enabled cluster autoscaling with a minimum cluster size of 1 and a maximum of 5
  • All kube-system pods continued to run on the one original node
  • I ran a Job that taxed the system to the extent that CA added a new node
  • After the job was finished, all pods were removed, leaving only kube-system pods running

What I expected:

  • CA would scale down the cluster to its initial (minimum) size (one node) shortly after the taxing job was complete

What happened:

  • The cluster stayed at two nodes
  • When CA added a new node, the kube-system pods had been redistributed to run across both nodes
  • According to the FAQ here, this happened because CA will not remove nodes running system pods

Suggestion:

  • If a cluster is above its configured minimum size only because kube-system pods are preventing it from shrinking, the superfluous node(s) should be automatically drained, the kube-system pods should be consolidated onto a single node (or rather, onto a number of nodes equal to the cluster's configured minimum), and the superfluous nodes should be removed from the cluster
cluster-autoscaler

Most helpful comment

It does. If you have a pod that uses local storage, but you want to allow CA to move it around you can set "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" annotation on it.

All 7 comments

Thank you for the report. We are aware of this and we thinking about the best approach to solve it.

cc: @MaciekPytel @fgrzadkowski

@adamrp It will be possible in the new version. However you will have to create a PodDisruptionBudget for all of your system pods like dns, dashboard, heapster (evertyhin other than daemonset/kube-proxy/etc).

Any best practices PodDisruptionBudget for kube-system? Thanks.
The node below should be scaled down.

Name:               gke-aaa-xxx-yyy-zzz-po-62b92cb1-0256
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/fluentd-ds-ready=true
                    beta.kubernetes.io/instance-type=n1-highcpu-4
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-nodepool=xxx-staging-pool
                    failure-domain.beta.kubernetes.io/region=asia-northeast1
                    failure-domain.beta.kubernetes.io/zone=asia-northeast1-c
                    kubernetes.io/hostname=gke-aaa-xxx-yyy-zzz-po-62b92cb1-0256
                    name=stg-node
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp:  Tue, 31 Jul 2018 12:20:14 +0900
Taints:             <none>
Unschedulable:      false
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  KernelDeadlock       False   Tue, 31 Jul 2018 15:42:09 +0900   Tue, 31 Jul 2018 12:20:12 +0900   KernelHasNoDeadlock          kernel has no deadlock
  NetworkUnavailable   False   Tue, 31 Jul 2018 12:20:15 +0900   Tue, 31 Jul 2018 12:20:15 +0900   RouteCreated                 NodeController create implicit route
  OutOfDisk            False   Tue, 31 Jul 2018 15:42:09 +0900   Tue, 31 Jul 2018 12:20:14 +0900   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure       False   Tue, 31 Jul 2018 15:42:09 +0900   Tue, 31 Jul 2018 12:20:14 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 31 Jul 2018 15:42:09 +0900   Tue, 31 Jul 2018 12:20:14 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 31 Jul 2018 15:42:09 +0900   Tue, 31 Jul 2018 12:20:14 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 31 Jul 2018 15:42:09 +0900   Tue, 31 Jul 2018 12:20:34 +0900   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.22.0.52
  ExternalIP:
  Hostname:    gke-aaa-xxx-yyy-zzz-po-62b92cb1-0256
Capacity:
 cpu:                4
 ephemeral-storage:  36937420Ki
 hugepages-2Mi:      0
 memory:             3631468Ki
 pods:               110
Allocatable:
 cpu:                3920m
 ephemeral-storage:  12566689736
 hugepages-2Mi:      0
 memory:             2585964Ki
 pods:               110
System Info:
 Machine ID:                 a04db0d2e5ef3959f42ff7743ae7e79e
 System UUID:                A04DB0D2-E5EF-3959-F42F-F7743AE7E79E
 Boot ID:                    a68a519f-e2d3-47e5-9981-932cc5b52cef
 Kernel Version:             4.14.22+
 OS Image:                   Container-Optimized OS from Google
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://17.3.2
 Kubelet Version:            v1.10.4-gke.2
 Kube-Proxy Version:         v1.10.4-gke.2
PodCIDR:                     10.48.3.0/24
ProviderID:                  gce://xxxx-xxxx-xxxx/xxxx-xxxx-c/gke-aaa-xxx-yyy-zzz-po-62b92cb1-0256
Non-terminated Pods:         (5 in total)
  Namespace                  Name                                                              CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                                              ------------  ----------  ---------------  -------------
  kube-system                gke-aaa-xxx-yyy-zzz-po-62b92cb1-0256    100m (2%)     0 (0%)      0 (0%)           0 (0%)
  kube-system                metadata-agent-xxbk8                                              40m (1%)      0 (0%)      50Mi (1%)        0 (0%)
  kube-system                metrics-server-v0.2.1-7c88c7f9-zjsxf                              57m (1%)      152m (3%)   186Mi (7%)       436Mi (17%)
  logging                    fluentd-colopl-hq74b                                              100m (2%)     0 (0%)      800Mi (31%)      800Mi (31%)
  monitoring                 node-exporter-r842k                                               0 (0%)        0 (0%)      0 (0%)           0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource  Requests      Limits
  --------  --------      ------
  cpu       297m (7%)     152m (3%)
  memory    1036Mi (41%)  1236Mi (48%)
Events:     <none>

Any best practices PodDisruptionBudget for kube-system?

Only add them if you're sure those pods can be safely evicted :) E.g. creating a PDB for kube-dns with minAvailable > 50% should be OK in most cases. However, evicting all kube-dns pods at the time can cause networking problems. In case of metrics-server or heapster, there's only one replica, so a PDB for it will always cause the risk of downtime (loss of metrics) for a couple of minutes. It really depends on your setup whether it's acceptable or not.

In this particular case your node is running metrics-server. Restarting it would prevent all HPAs in your cluster from autoscaling (and make them set error status, emit error events, etc) for a few minutes. As @aleksandra-malinowska wrote it's really your decision if it's acceptable or not.

In general kube-dns is the only system pod that comes to my mind that has multiple replicas. Every other system pod is a singleton and restarting it will likely cause some kind of temporary disruption in cluster. If it's a test cluster it may be ok to just create a PDB for every kube-system pod. If it's production you probably need to make case-by-case decision based on which services are critical for your workloads.

@aleksandra-malinowska @MaciekPytel Thank you for so detailed explanation. Maybe the best way is set non-PDB for kube-system in my case.

Could I ask one more question? I noticed in FAQ What types of pods can prevent CA from removing a node?

Pods with restrictive PodDisruptionBudget.
Kube-system pods that:
    are not run on the node by default, *
    don't have PDB or their PDB is too restrictive (since CA 0.6).
Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *
Pods with local storage. *`

What does local storage.* actually meaning? Does hostPath, emptyDir belongs it?

It does. If you have a pod that uses local storage, but you want to allow CA to move it around you can set "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" annotation on it.

Was this page helpful?
0 / 5 - 0 ratings