Autoscaler: Scale-down causes downtime for the pods to be moved in another node (aws EKS)

Created on 20 May 2019  路  28Comments  路  Source: kubernetes/autoscaler

I have set up K8S cluster using EKS. CA has been configured to increase/decrease the number of nodes based on resources availability for pods. During scale-down, the CA terminates a node before moving pods in the node on another node. So, the pods get scheduled on another node after the node gets terminated. Hence, There is some downtime until the re-scheduled pods become healthy on another node.

How can I avoid the downtime by ensuring that the pods get scheduled on another node before the node gets terminated?

Deployment :

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
        - image: k8s.gcr.io/cluster-autoscaler:v1.12.3
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/production
            - --balance-similar-node-groups=true
          env:
            - name: AWS_REGION
              value: eu-central-1
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/kubernetes/pki/ca.crt"
areprovideaws

Most helpful comment

Please, can someone explain how CA can drain a specific node upfront before AWS ASG decides which node exactly to terminate on scale-down event ?

e.g. when CA changes the desired nodes, e.g. from 3 to 2, and then ASG starts to terminate a random node - how does CA know which node to drain ?

I got few days off and just see the message.
As Alexa said, "CA always requests deleting a specific node." ASG won't randomly terminate node. Instead, CA drain specific node and ASG terminate node by node id.

The root problem of this issue is CA use evict API rather than Drain API but service controller uses a NotScheduable label to remove a node from load balancer endpoints. Only drain API will label node, evict API will not do that.

All 28 comments

@niteshsakhiya can you take a look at this issue?
https://github.com/kubernetes/autoscaler/issues/1907

We have some solutions but not elegant optimized.

  1. make node unschedule.
  2. add timeout since upstream service controller takes up to 100s to get node status.

https://github.com/Jeffwan/autoscaler/commit/0c02d5bed0d8555187a2b1b289e1044c6f9e2b5c#diff-5f36cd12baa8998cd7f2ba7c4a00cbc6

/sig aws

@Jeffwan
I checked the issue.

  1. Sorry, I am not sure about how making the node unSchedulable would help here.
  2. The timeout is already set to 10 minutes(default).

kubernetes service controller will fetch node changes. For every iteration, it will check node changes, if this node is unSchedulable, then it will call cloudprovider method to remove node from load balancer. If node just have taint like CA grants now, aws cloud provider won't remove it from load balancer which is not what we expect. That's the reason I make node unschedule. We have more discussion in the thread I show you, talking about elegant long term support.

This is something that interest me. Can鈥檛 CA drain a node before actually terminating it?

This is something that interest me. Can鈥檛 CA drain a node before actually terminating it?

CA drains (or at least, should drain) a node before terminating. It uses eviction API, and if eviction can't be granted, backs off from deleting that node.

Please, can someone explain how CA can drain a specific node upfront before AWS ASG decides which node exactly to terminate on scale-down event ?

e.g. when CA changes the desired nodes, e.g. from 3 to 2, and then ASG starts to terminate a random node - how does CA know which node to drain ?

Please, can someone explain how CA can drain a specific node upfront before AWS ASG decides which node exactly to terminate on scale-down event ?

e.g. when CA changes the desired nodes, e.g. from 3 to 2, and then ASG starts to terminate a random node - how does CA know which node to drain ?

CA always requests deleting a specific node. This is a basic requirement for implementing support for a new cloud provider. My understanding is that so far it worked on AWS as well.

Relevant API call: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go#L242

@Jeffwan can you explain what's going on here? This doesn't sound like a Cluster Autoscaler issue, right?

@aleksandra-malinowska OK, so it looks like upon termination of specific instance we do decrement the desired number. Now it makes sense. Thanks.

Please, can someone explain how CA can drain a specific node upfront before AWS ASG decides which node exactly to terminate on scale-down event ?

e.g. when CA changes the desired nodes, e.g. from 3 to 2, and then ASG starts to terminate a random node - how does CA know which node to drain ?

I got few days off and just see the message.
As Alexa said, "CA always requests deleting a specific node." ASG won't randomly terminate node. Instead, CA drain specific node and ASG terminate node by node id.

The root problem of this issue is CA use evict API rather than Drain API but service controller uses a NotScheduable label to remove a node from load balancer endpoints. Only drain API will label node, evict API will not do that.

@Jeffwan thanks for explanation. Makes sense.

Hi, so is there easy solution to prevent from downtime?
I have 1 replica deployment and during scale down it's not waiting for other new replica to be ready.

Unfortunately the solution is to either always run at least 2 replicas or disable scale-down that would restart this pod (using PDB or safe-to-evict annotation). CA is not migrating the pod, it's really only killing it and relying on Kubernetes to restart it. This mechanism can't be easily extended to create-before-delete.

@Jeffwan i'm having problems with ASG terminating ramdonly ec2 instances, i described on issue https://github.com/kubernetes/autoscaler/issues/2787, if i understand correctly do you said that this behvaior can't happen, it's correct?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

This issue is important.

Hi all,

The workarounds mentioned above seem will cause a decrease in the utilization of nodes.

Unfortunately the solution is to either always run at least 2 replicas or disable scale-down that would restart this pod (using PDB or safe-to-evict annotation). CA is not migrating the pod, it's really only killing it and relying on Kubernetes to restart it. This mechanism can't be easily extended to create-before-delete.

Has this problem been solved now? Will I encounter downtime when the CA triggers the scale-down action?

If not, are there other workarounds to avoid the downtime and not influent the utilization of the nodes?

Hi everyone ,
Is there any update regarding this issue ? running at least 2 replicas didn't solve the problem , I often have my app not available when CA is scaling up or down!

@dhouhamaa This can happen if your both pods happen to be scheduled on the same node. You can define a PodDisruptionBudget to only allow one of your 2 pods to be down at any given time - CA will respect PDB and never restart both at the same time.

@maciaszczykm But the idea of replicas is not so optimal in my case , because I have more than one app running in each namespace and thus multiple pods already , setting 2 replicas for each is not the best way I guess. I was looking if there was another workaround to avoid downtime without the need for the replicas.

There is no workaround I can think of and I don't think there will be anytime soon - you can't possibly remove a node without stopping the pods running on it. So your options are really either prevent scale-down of nodes that run single pods (using PDB, annotation, etc) which will hurt utilization or you can run multiple replicas of each app (along with PDB) to allow scale-down without disruption.

CA doesn't have a functionality to add a replacement pod before deleting a node and I don't think we're going to add it in predictable future (probably never). Fundamentally CA operates on pods and nodes, it doesn't really have an abstraction for a deployment or any other collection of pods. It would probably require huge changes to add something like this and it may be very hard to do without crippling CA performance in very large clusters.

It is possible to isolate pods from ReplicaSet by changing labels. ReplicaSet adds a label pod-template-hash to all the pods controlled by it. On removing or changing the value of this label, the pod will get isolated from the ReplicaSet. ReplicaSet will create new pods to maintain the number of replicas. This provides an easy workaround to proactively add the replacement pods before deleting the node. The sequence of steps at scale down can be:

  1. Cordon node.
  2. Isolate pods on the node from their ReplicaSet by removing pod-template-hash label.
  3. Wait for some time for replacement pods to get ready. This needs to be configurable.
  4. Evict pods on the node.
  5. Delete the node from ASG.

It is possible to isolate pods from ReplicaSet by changing labels. ReplicaSet adds a label pod-template-hash to all the pods controlled by it. On removing or changing the value of this label, the pod will get isolated from the ReplicaSet. ReplicaSet will create new pods to maintain the number of replicas. This provides an easy workaround to proactively add the replacement pods before deleting the node. The sequence of steps at scale down can be:

  1. Cordon node.
  2. Isolate pods on the node from their ReplicaSet by removing pod-template-hash label.
  3. Wait for some time for replacement pods to get ready. This needs to be configurable.
  4. Evict pods on the node.
  5. Delete the node from ASG.

@MaciekPytel, any thoughts on this?

Pardon any naivete on my part, but it seems this is the correct cause of the issue:

The root problem of this issue is CA use evict API rather than Drain API but service controller uses a NotScheduable label to remove a node from load balancer endpoints. Only drain API will label node, evict API will not do that.

Shouldn't the solution be to switch to the Drain API?

https://stackoverflow.com/questions/57189208/what-are-the-api-involved-during-kubectl-cordon-and-drain-command

Checking back in on this persistent issue.

Was this page helpful?
0 / 5 - 0 ratings