Autoscaler: Scale-down causes downtime for the pods to be moved in another node (aws EKS)

Created on 20 May 2019 · 28Comments · Source: kubernetes/autoscaler

I have set up K8S cluster using EKS. CA has been configured to increase/decrease the number of nodes based on resources availability for pods. During scale-down, the CA terminates a node before moving pods in the node on another node. So, the pods get scheduled on another node after the node gets terminated. Hence, There is some downtime until the re-scheduled pods become healthy on another node.

How can I avoid the downtime by ensuring that the pods get scheduled on another node before the node gets terminated?

Deployment :

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
        - image: k8s.gcr.io/cluster-autoscaler:v1.12.3
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/production
            - --balance-similar-node-groups=true
          env:
            - name: AWS_REGION
              value: eu-central-1
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/kubernetes/pki/ca.crt"

areprovideaws

Source

niteshsakhiya

Most helpful comment

Please, can someone explain how CA can drain a specific node upfront before AWS ASG decides which node exactly to terminate on scale-down event ?

e.g. when CA changes the desired nodes, e.g. from 3 to 2, and then ASG starts to terminate a random node - how does CA know which node to drain ?

I got few days off and just see the message.
As Alexa said, "CA always requests deleting a specific node." ASG won't randomly terminate node. Instead, CA drain specific node and ASG terminate node by node id.

The root problem of this issue is CA use evict API rather than Drain API but service controller uses a NotScheduable label to remove a node from load balancer endpoints. Only drain API will label node, evict API will not do that.

Jeffwan on 10 Jul 2019

👍5

All 28 comments

@niteshsakhiya can you take a look at this issue?
https://github.com/kubernetes/autoscaler/issues/1907

We have some solutions but not elegant optimized.

make node unschedule.
add timeout since upstream service controller takes up to 100s to get node status.

https://github.com/Jeffwan/autoscaler/commit/0c02d5bed0d8555187a2b1b289e1044c6f9e2b5c#diff-5f36cd12baa8998cd7f2ba7c4a00cbc6

Jeffwan on 20 May 2019

/sig aws

Jeffwan on 20 May 2019

@Jeffwan
I checked the issue.

Sorry, I am not sure about how making the node unSchedulable would help here.
The timeout is already set to 10 minutes(default).

niteshsakhiya on 27 May 2019

kubernetes service controller will fetch node changes. For every iteration, it will check node changes, if this node is unSchedulable, then it will call cloudprovider method to remove node from load balancer. If node just have taint like CA grants now, aws cloud provider won't remove it from load balancer which is not what we expect. That's the reason I make node unschedule. We have more discussion in the thread I show you, talking about elegant long term support.

Jeffwan on 29 May 2019

This is something that interest me. Can’t CA drain a node before actually terminating it?

gustavosoares on 7 Jun 2019

This is something that interest me. Can’t CA drain a node before actually terminating it?

CA drains (or at least, should drain) a node before terminating. It uses eviction API, and if eviction can't be granted, backs off from deleting that node.

aleksandra-malinowska on 7 Jun 2019

Please, can someone explain how CA can drain a specific node upfront before AWS ASG decides which node exactly to terminate on scale-down event ?

e.g. when CA changes the desired nodes, e.g. from 3 to 2, and then ASG starts to terminate a random node - how does CA know which node to drain ?

Constantin07 on 3 Jul 2019

Please, can someone explain how CA can drain a specific node upfront before AWS ASG decides which node exactly to terminate on scale-down event ?

e.g. when CA changes the desired nodes, e.g. from 3 to 2, and then ASG starts to terminate a random node - how does CA know which node to drain ?

CA always requests deleting a specific node. This is a basic requirement for implementing support for a new cloud provider. My understanding is that so far it worked on AWS as well.

aleksandra-malinowska on 3 Jul 2019

Relevant API call: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go#L242

@Jeffwan can you explain what's going on here? This doesn't sound like a Cluster Autoscaler issue, right?

aleksandra-malinowska on 3 Jul 2019

👍1

@aleksandra-malinowska OK, so it looks like upon termination of specific instance we do decrement the desired number. Now it makes sense. Thanks.

Constantin07 on 3 Jul 2019

Please, can someone explain how CA can drain a specific node upfront before AWS ASG decides which node exactly to terminate on scale-down event ?

e.g. when CA changes the desired nodes, e.g. from 3 to 2, and then ASG starts to terminate a random node - how does CA know which node to drain ?

Jeffwan on 10 Jul 2019

👍5

@Jeffwan thanks for explanation. Makes sense.

Constantin07 on 10 Jul 2019

Hi, so is there easy solution to prevent from downtime?
I have 1 replica deployment and during scale down it's not waiting for other new replica to be ready.

mlewiarz on 2 Nov 2019

Unfortunately the solution is to either always run at least 2 replicas or disable scale-down that would restart this pod (using PDB or safe-to-evict annotation). CA is not migrating the pod, it's really only killing it and relying on Kubernetes to restart it. This mechanism can't be easily extended to create-before-delete.

MaciekPytel on 4 Nov 2019

@Jeffwan i'm having problems with ASG terminating ramdonly ec2 instances, i described on issue https://github.com/kubernetes/autoscaler/issues/2787, if i understand correctly do you said that this behvaior can't happen, it's correct?

caiohasouza on 31 Jan 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 30 Apr 2020

/remove-lifecycle stale

spanktar on 28 May 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 26 Aug 2020

/remove-lifecycle stale

This issue is important.

spanky-medal on 1 Sep 2020

Hi all,

The workarounds mentioned above seem will cause a decrease in the utilization of nodes.

Unfortunately the solution is to either always run at least 2 replicas or disable scale-down that would restart this pod (using PDB or safe-to-evict annotation). CA is not migrating the pod, it's really only killing it and relying on Kubernetes to restart it. This mechanism can't be easily extended to create-before-delete.

Has this problem been solved now? Will I encounter downtime when the CA triggers the scale-down action?

If not, are there other workarounds to avoid the downtime and not influent the utilization of the nodes?

neuqlz on 16 Sep 2020

Hi everyone ,
Is there any update regarding this issue ? running at least 2 replicas didn't solve the problem , I often have my app not available when CA is scaling up or down!

dhouhamaa on 7 Oct 2020

@dhouhamaa This can happen if your both pods happen to be scheduled on the same node. You can define a PodDisruptionBudget to only allow one of your 2 pods to be down at any given time - CA will respect PDB and never restart both at the same time.

MaciekPytel on 7 Oct 2020

@maciaszczykm But the idea of replicas is not so optimal in my case , because I have more than one app running in each namespace and thus multiple pods already , setting 2 replicas for each is not the best way I guess. I was looking if there was another workaround to avoid downtime without the need for the replicas.

dhouhamaa on 7 Oct 2020

There is no workaround I can think of and I don't think there will be anytime soon - you can't possibly remove a node without stopping the pods running on it. So your options are really either prevent scale-down of nodes that run single pods (using PDB, annotation, etc) which will hurt utilization or you can run multiple replicas of each app (along with PDB) to allow scale-down without disruption.

CA doesn't have a functionality to add a replacement pod before deleting a node and I don't think we're going to add it in predictable future (probably never). Fundamentally CA operates on pods and nodes, it doesn't really have an abstraction for a deployment or any other collection of pods. It would probably require huge changes to add something like this and it may be very hard to do without crippling CA performance in very large clusters.

MaciekPytel on 7 Oct 2020

👍1

It is possible to isolate pods from ReplicaSet by changing labels. ReplicaSet adds a label pod-template-hash to all the pods controlled by it. On removing or changing the value of this label, the pod will get isolated from the ReplicaSet. ReplicaSet will create new pods to maintain the number of replicas. This provides an easy workaround to proactively add the replacement pods before deleting the node. The sequence of steps at scale down can be:

Cordon node.
Isolate pods on the node from their ReplicaSet by removing pod-template-hash label.
Wait for some time for replacement pods to get ready. This needs to be configurable.
Evict pods on the node.
Delete the node from ASG.

zahid0 on 7 Nov 2020

It is possible to isolate pods from ReplicaSet by changing labels. ReplicaSet adds a label pod-template-hash to all the pods controlled by it. On removing or changing the value of this label, the pod will get isolated from the ReplicaSet. ReplicaSet will create new pods to maintain the number of replicas. This provides an easy workaround to proactively add the replacement pods before deleting the node. The sequence of steps at scale down can be:

Cordon node.

Isolate pods on the node from their ReplicaSet by removing pod-template-hash label.

Wait for some time for replacement pods to get ready. This needs to be configurable.

Evict pods on the node.

Delete the node from ASG.

@MaciekPytel, any thoughts on this?

zahid0 on 18 Nov 2020

Pardon any naivete on my part, but it seems this is the correct cause of the issue:

The root problem of this issue is CA use evict API rather than Drain API but service controller uses a NotScheduable label to remove a node from load balancer endpoints. Only drain API will label node, evict API will not do that.

Shouldn't the solution be to switch to the Drain API?

https://stackoverflow.com/questions/57189208/what-are-the-api-involved-during-kubectl-cordon-and-drain-command

spanky-medal on 30 Dec 2020

Checking back in on this persistent issue.

spanky-medal on 16 Feb 2021

Was this page helpful?

0 / 5 - 0 ratings