Autoscaler: Feature Request: Cluster Autoscaler - scale up Deployment if PCB doesn't meet the scale down/drain requirements

Created on 4 Oct 2018  路  17Comments  路  Source: kubernetes/autoscaler

I'm looking over the cluster-autoscaler source to find where this could be added, not sure I have the go chops to submit a PR, but wanted to share this as a feature request.

I have a very RAM heavy microservice java app (20gb no load), in prod we'll run two of every service for HA. However our dev/test/stage environments it would be too expensive to run more than 1 of each service.

I have each microservice in a deployment, with a Horizontal Pod Autoscaler to handle usage spikes / performance testing.

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: service-autoscaler
  namespace: app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: service
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 700

In initial deployment in the dev environment, the cluster autoscaler caused significant occasional downtime as it balanced the cluster. These Java microservices sometimes take 4 minutes to become ready. So to avoid this, I applied a Pod Disruption Budget on each service.

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: service-budget
  namespace: app
spec:
  minAvailable: 1
  selector:
    matchLabels:
      run: service

So far so good. However since the Deployments have only 1 replica, this permanently blocks the cluster autoscaler from scaling down.

There is an issue on Kubernetes relating to the kubectl drain in this scenario, but it's been closed due to inactivity. Seems like this use case for pdb wasn't recommended because it isn't HA with 1 replica. The argument being, Kubernetes should make 1 replica as available as possible were downtime even without a pdb should be avoided at all costs.

The feature as follows.

  • Check for pdb
  • If a pdb is blocking a drain, scale the deployment/pod up until it meets the pdb criteria, ensuring the new pods are scheduled on ideal nodes.
  • Wait for new Pods to become ready
  • Drain the node

Maybe this could result in a new command argument that forces this behavior on all Deployments with 1 replica, without the need for a pdb.

As I search for a place to implement this, I found that cluster-autoscaler/simulator/drain.go could be a starting place. From there I'm not sure how this could be implemented, but if there's any guidance where I could begin testing an implementation I'll gladly attempt to figure this out.

One issue I see off the get go is determining if the pod is attached to a deployment, and if there is a horizontal autoscaler to read from to determine if scaling up is acceptable.

cluster-autoscaler kinfeature

Most helpful comment

I think this is most useful in a non-prod environment. As @blandman noted, his use case for wanting this feature is for a dev environment where it might be too costly to run multiple replicas of an app, coupled with the need for for autoscaling the cluster.

Our use case is similar. We have a non-prod kubernetes cluster in AWS, complete with the cluster autoscaler. We want the cluster autoscaler to be able to aggressively scale in nodes, but we need it to do so with as little downtime as possible.

It would be nice if the pod disruption budget had a maxSurge property similar to rolling updates, and then if there was a voluntary eviction, Kubernetes can make use of the max surge in order to scale temporarily, re-locate the pod, and then continue with the scale down of the nodes.

All 17 comments

If there is an HPA scaling the deployment in question, I think this would mean that cluster autoscaler and Horizontal Pod Autoscaler would have to communicate, as otherwise HPA would possibly scale down the deployment that CA just scaled up to be able to drain the node and not violate the PDB.
@MaciekPytel @aleksandra-malinowska do you think there is another way to accommodate this use case?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Can we keep it fresh please?

Two thoughts:

  1. If any downtime of the application is unacceptable, even a couple of minutes, running only a single replica of it is very risky. I'd suggest reconsidering this setup before it fails.

  2. Supposing that for improving the availability of a best-effort application this behavior could be useful, I wonder if it could be incorporated into eviction. Once eviction object is created while PDB is satisfied, owner controller could start a new replica while the old one is still in graceful termination period. That way, the problem would be solved regardless of HPA and kubectl drain would benefit from it as well.

I think this is most useful in a non-prod environment. As @blandman noted, his use case for wanting this feature is for a dev environment where it might be too costly to run multiple replicas of an app, coupled with the need for for autoscaling the cluster.

Our use case is similar. We have a non-prod kubernetes cluster in AWS, complete with the cluster autoscaler. We want the cluster autoscaler to be able to aggressively scale in nodes, but we need it to do so with as little downtime as possible.

It would be nice if the pod disruption budget had a maxSurge property similar to rolling updates, and then if there was a voluntary eviction, Kubernetes can make use of the max surge in order to scale temporarily, re-locate the pod, and then continue with the scale down of the nodes.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

This sort of logic is also useful in real production scenarios. If I have 2 pods close to targetAverageUtilization and I start a drain, cutting one off means I'm now close to 50% underprovisioned and will see severe degradation until the HPA kicks in.

One solution is to run a high enough minReplicas, but I'd love Kubernetes to help me not spend more money at low traffic / idle.

A more comprehensive feature that could help is, in pseudo-code:

drainBlocked = pdb.maxUnavailable == 0 || pdb.minAvailable == hpa.currentReplicas
canScaleUp = hpa.maxReplicas > hpa.currentReplicas

if drainBlocked and canScaleUp
  newPod = scaleUp(deploy)
  waitForReady(newPod, SOME_TIMEOUT)
  terminate(oldPod)
else
  fail "drain blocked"

In plain English, scale up and ensure readiness before terminating when PDB is blocking up a scale-up would be allowed by an HPA. Only block a drain when at maxReplicas.

This way we have a way to communicate the desire for no degraded capacity or no downtime for sensitive applications (maxUnavailable: 0 or minAvailable == currentReplicas), yet we keep the ability to completely block drains for applications where we simply can't tolerate voluntary disruptions or that can't tolerate multiple coexisting instances (by running them at maxReplicas). The desire to cover the former comes from mine and blandman's scenarios. The desire to cover the latter can be inferred by the fact the PDB docs cover it, and I've seen applications that can't tolerate multiple coexisting instances (Metabase comes to mind, not sure if still true).

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

The logic I describe above is also useful for binpacking. Binpacking happens a lot in the ramp down of traffic, and it's more likely that it will catch workloads at low Pod count.

Terminating a Pod to reschedule it for binpacking when we have 2 pods means we lose 50% compute capacity for a period of time and will cause degradation for workloads running at anything close to or above 50% CPU utilization.

Having the ability to scale up to respect maxUnavailable or minAvailable means we can run workloads with minAvailable: 2 and still binpack effectively at idle, with no service degradation.

This would be particularly useful for us. We have a similar issue with a RAM heavy app. Running replicas at prod is fine but its wasteful in Dev/Test as we dont need the availability.

I wish you could specify with a PDB that during drain use a similar strategy to RollingUpdate in deployment. Old pod isnt removed until a new one is created and ready.

https://github.com/kubernetes/kubernetes/issues/66811 been discussed here and this is tricky due to the disconnected structure of each of the components of kubernetes. I.E scheduler has no idea about the PDB or availabilty config, can only be reactive not anticipate a pod disappearing.

Was this page helpful?
0 / 5 - 0 ratings