Community: Revisiting pod autoscaling on a queue based architecture

Created on 3 Jun 2018 · 44Comments · Source: kubernetes/community

@kubernetes/sig-autoscaling-proposals
@kubernetes/sig-autoscaling-reviews

The current best practice to autoscale pods based on the number of messages in a queue is to use an HPA with a metric of type External.

While this is good in theory, it's somewhat inconvenient in practice.

As a developer, I'd rather not expose the underlying message broker. In fact, I'd probably create an abstraction to access it.
If later on, I wish to replace RabbitMQ with Kafka, I can make the change and be confident that nothing will break because the public interface hasn't changed.

So while _External_ might have specific use cases, it's a less convenient way to scale my (micro) services since it exposes the message broker directly.

A more attractive option is to wrap the queue into a microservice.

Imagine you have a publisher, a consumer and a queue that decouples the two.
The publisher sends messages to the queue, the consumer picks them up at its own pace.

Typically, I'd expose custom metrics for the queue from the consumer, the publisher or both since they already know how to talk to the queue.
If you access /metrics on the consumer, you should expect a custom metric that contains the length of the queue (amongst other things). If you have three pods and there're four messages in the queue, you could request the /metrics endpoints for each of them and expect:

# HELP messages Number of messages in the queue
# TYPE messages gauge
messages 4

You could leverage an HPA with a metric of type Pods to autoscale the workers.
Unfortunately, the _Pods_ metric for the HPA only does an average of value. And since all consumers expose the same value the average is the value exposed by each Pod — it's always four, independently of the number of pods.
In other words, adding more Pods doesn't change the average value of the metric.
In this instance, the autoscaler will keep doubling the Pods until the metric is under the threshold. The number of Pods is not scaled proportionally with the size of the queue.

The other option is to use an HPA with a metric of type Object. Which is excellent because the metric computes a ratio between the target and the current value. So you could have the workers scale up linearly with the size of the queue. Unfortunately, the Object type cannot be applied to Pods that are created by a Deployment (or ReplicaSet). And from an engineering perspective, I need _someone_ to look after my Pods when they die. So HPA with Object is not an option.

Long story short.

I'd like to scale my workers based on the number of messages in the queue
the number of workers should be proportional to the number of messages
I'd like to expose the size of the queue as a custom metric from a Pod
I'm not allowed to have lonely Pods — Pods have to be wrapped in Deployments or RSs

As an example, if I defined a ratio of 10 messages for Pod and there're 72 messages in the queue, I'd expect the HPA to settle the number of replicas to 7.

As far as I understand, none of the current solutions allows me to do that. I'd like to put forward a proposal to do that.

lifecyclstale siautoscaling

Source

danielepolencic

👍17

Most helpful comment

In my case, the metric is exposed by the Pod, not the Deployment. Unless Deployments aggregate custom metrics from Pods, I won't be able to use that.

danielepolencic on 23 Aug 2018

👀3 🚀1 ❤1 😕1 🎉1 😄1 👎1 👍1

All 44 comments

/sig autoscaling

adohe-zz on 4 Jun 2018

😄1

I'm trying to achive the same, with the only exception that I'm running a dedicated exporter for the queue length. I have no problem using the Object based HPA, though? I don't understand the limitation you have in that area.

That being said, I struggle with scaling smoothly. For some reason I can't confirm that the HPA sets the replicas to current/desired. Instead what I see is that the pod count is doubled each time I reach the threshold. Then, when the metric is below threshold it doesn't do anything for several minutes until it scales to min replicas. I'd expect it to scale down lineraly with the queue size but somehow it doesn't.

discordianfish on 22 Aug 2018

👍4

Object-based HPA is for a single object. While it might work well for an ingress, it doesn't work well for a pod. In fact, I don't deploy just pods. I deploy pods in deployments or replica sets. If a pod dies, it is respawned (by the RS). That doesn't play nicely with the Object HPA because it assumes the name doesn't change.

The current algorithm for scaling application is not linear. The code gives a conclusive answer. Which translated in plain English reads as:

MAX(CURRENT_REPLICAS_LENGTH * 2, 4)

So it doubles all the time. You can notice a scale up or a scale down only every 3 and 5 minutes respectively. You should tune the parameters if you wish to have a more (or less) reactive cluster:

horizontal-pod-autoscaler-downscale-delay
horizontal-pod-autoscaler-sync-period

You should set those in the controller manager. I test the above in minikube with the following command.

minikube start \
  --memory 8096 \
  --extra-config=controller-manager.horizontal-pod-autoscaler-upscale-delay=1m \
  --extra-config=controller-manager.horizontal-pod-autoscaler-downscale-delay=2m \
  --extra-config=controller-manager.horizontal-pod-autoscaler-sync-period=10s

I discussed all of the above in a nice blog post if you wish to have some context for my proposal.

danielepolencic on 22 Aug 2018

😄2

@danielepolencic Thanks for the link to the code, that's very helpful!

I too run pods via deployments, so still not sure about your problem. You can scale a deployment based on a object metric:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: foo-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: foo-deployment
  minReplicas: 1
  maxReplicas: 20
  metrics:
  - type: Object
    object:
      metricName: some_metric
      targetValue: 1
      target:
        kind: Deployment
        name: foo-deployment

discordianfish on 23 Aug 2018

😄1

In my case, the metric is exposed by the Pod, not the Deployment. Unless Deployments aggregate custom metrics from Pods, I won't be able to use that.

danielepolencic on 23 Aug 2018

👀3 🚀1 ❤1 😕1 🎉1 😄1 👎1 👍1

@danielepolencic Are you using Prometheus? In that case you can use relabling to add a deployment label to any metric. In my case it comes from an exporter of a different deployment but even if all your pods expose the metric, this should be doable with something like this:

  - record: queue_messages
    expr: label_replace(queue_messages, "deployment", "your-deployment", "", "")

You could also join the metric with kube_pod_info if you don't want to hardcode this.

That being said, I agree the overall UX should be improved here.

discordianfish on 23 Aug 2018

Where do I write that line? In the HPA?

danielepolencic on 30 Aug 2018

❤1

@danielepolencic It's a recording rule for prometheus, that creates a new timeseries which k8s-metrics-adapter should pick up as custom metric: https://prometheus.io/docs/practices/rules/

discordianfish on 31 Aug 2018

❤1

That being said, I realized that just scaling on the queue length doesn't work well for queues with high throughput, since it will always scale to min if the queue ever becomes empty (which it does usually in our case), which causes our queue to fill up rapidly, causing it to scale up to max.

We'll solve this by scaling on both the queue size and the ingress rate but I agree that in general this should be easier. Something like a PID loop might make most sense.

discordianfish on 31 Aug 2018

👍4 👎1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 29 Nov 2018

🚀1

Hi.

I consider to support scaling for RabbitMQ priority queues. I think about these metrics:

ingress rate for high priority messages to make sure we have enough capacity for low latency processing
QoS processing time for different priority classes. For example 10 seconds for priority 1 and 2 hours for priority 0 (the lowest).

The use case is online and offline job processing.

DXist on 14 Dec 2018

👀1

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 13 Jan 2019

🎉1

Something to consider (and to to make sure I understand this correctly...): scaling will not be linear with the queue size, and the number of workers will never be proportional to the number of messages. The growth rate will be (in theory, in practice as you've shown in the code, it's not, but never mind...).

The auto-scaler always tries to reach a ratio of 1 between the metric and the target. So if you are using the queue size as your metric, as long as the queue size is larger than the target, it will scale up, and up and up, until such time as it reaches the target, in which case it will stay stable, and if decreasing (obviously more pods will consume messages faster, thus the queue will start to shrink), it will then do the opposite (again, exponentially, with the velocity being proportional, not the actual size). What you need to know is the "ideal" size of your queue, that is the size that gives you the best performance, and have that as a target. Unlike "Pod" metrics, with "Object" metrics, the autoscaling API (at least v1beta1) does not divide by the number of current pods, so if you we want proportional scaling, we need something else. Not sure what, though actual messages handled per second reported by each pod sounds good enough...

regmilton on 15 Jan 2019

👍2

@regmilton Thanks for this. I couldn’t have put it better myself.

so if you we want proportional scaling, we need something else.

As a developer, that would be my expectation when I design a queue based architecture. The more messages in the queue, the more Pods I'd like to be scheduled.

actual messages handled per second reported by each pod sounds good enough

That sounds close enough to me. But is it applicable to "Pods" or "Objects"?

danielepolencic on 15 Jan 2019

🚀1

At the moment I use 3 metrics for pod autoscaling for 3 cases:

normal mode - worker utilisation is under 80%. This metric is suppressed and becomes 0 when we don't have enough resources for online processing and queue starts to grow
spike of incoming messages - messages published / messages acked ratio is 1. Allows to scale up and compensate for increased ingress rate.
batch/offline mode - estimated queue processing time is less then X minutes

DXist on 16 Jan 2019

❤2

As a developer, that would be my expectation when I design a queue based architecture. The more messages in the queue, the more Pods I'd like to be scheduled.

That only applies when you catch up on a queue. I felt the same way but it didn't work well, as soon as the messages drop to zero the autoscaler will scale to zero causing the queue to back up quickly. @regmilton describes this very well.

discordianfish on 17 Jan 2019

👀1

@discordianfish _I think_ that's still fine. Catch up queues with a minimum number of replicas in the HPA should do the job. If the scaling is delayed, that's fine. I'm using queues because I want to decouple producers from consumers, I don't wish to have immediate scaling as more messages are poured into the queue. Having the Pods scaling immediately as more messages are produced sounds like a sync-like behaviour to me.

danielepolencic on 17 Jan 2019

🎉1

@danielepolencic - I'm facing the same issue. Aside from the inconvenience you listed, wouldn't the metric of type External satisfy the requirement since you can describe how the metric is used (Value, AverageValue or Utilization)? Perhaps I'm missing something.

For full disclosure, I'm currently using a custom metric implemented via the Prometheus adapter, but do plan on converting to an external metric once the PR is merged. I'm most definitely expecting this to work as advertised.

lrouquette on 7 Feb 2019

/remove-lifecycle rotten

lrouquette on 7 Feb 2019

I thought External is meant for external resources such as Azure Service Bus or Google Big Query and not Pods. Can you use external with objects inside the cluster?

danielepolencic on 8 Feb 2019

The way I read this is that the _metric_ itself is external to the cluster. How that metric is ingested is somewhat irrelevant; for instance, you could have Prometheus scrape the data directly from the Azure Service Bus, or AWS SQS. But nothing prevents you from exposing that metric to Prometheus from inside the k8s cluster: the object used to expose that metric (perhaps a Service) is completely irrelevant as far as the HPA is concerned.

lrouquette on 8 Feb 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 9 May 2019

/reopen
/remove-lifecycle stale
/remove-lifecycle rotten
/lifecycle freeze
/cc @fejta Feedback: I don't like it.

discordianfish on 14 May 2019

As an FYI we attempted to solve some of these problems to "wrap" the external metrics behind a custom resource definition called a ScaledObject in our project KEDA. I'd love to see more of this just baked into the HPA, but for now it allows defining a target threshold (e.g. Kafka lag or RabbitMQ queue length) and driving scale of a deployment on that, as well as potentially scaling all the way to zero on no events.

jeffhollan on 15 May 2019

❤1

/remove-lifecycle frozen

fejta on 15 May 2019

This works for us, and may work for others that are experiencing this issue. We utilize a metric that captures the number of messages dequeued in the last minute and specify the threshold using targetAverageValue. Because our microservices are throttled at X number of messages per minute, we can detect if they're under full load.

This is used in conjunction with a queue size metric. If the queue size exceeds Y value, we'll scale up the number of instances. Even if the queue size drops to 0, if the average number of messages dequeued by each instance matches X, it will not scale down.

furlongce on 24 Jun 2019

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 22 Sep 2019

/remove-lifecycle stale

deltasquare4 on 23 Sep 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 22 Dec 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 21 Jan 2020

/remove-lifecycle stale

lrouquette on 21 Jan 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 20 Feb 2020

👍1

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on 20 Feb 2020

👍1

/reopen
/remove-lifecycle rotten

itssimon on 24 Feb 2020

👍1

@itssimon: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

k8s-ci-robot on 24 Feb 2020

👍1

And another issue "fixed" by @fejta-bot... 💪

discordianfish on 24 Feb 2020

@danielepolencic would you mind reopening as it remains a relevant and open issue / feature request.

itssimon on 26 Feb 2020

/reopen

danielepolencic on 26 Feb 2020

👍2

@danielepolencic: Reopened this issue.

In response to this:

/reopen

k8s-ci-robot on 26 Feb 2020

I know this was just reopened -- but the community repo is used as the general management point of Kubernetes contributor community. Discussions like that are not likely to show up on the SIG stakeholder's radar. =/

If you'd like to get more visibility on the topic I'd move the convo to the autoscaling mailing list / slack and aim to propose a KEP with the changes.

mrbobbytables on 1 Mar 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 30 May 2020

As mentioned 90 days ago, this repo is no longer used for SIG design discussions. Closing this out since it is unlikely to be helpful. If you'd like to get more visibility on the topic I'd move the convo to the autoscaling mailing list / slack and aim to propose a KEP with the changes.

/close

coderanger on 30 May 2020

@coderanger: Closing this issue.

In response to this:

As mentioned 90 days ago, this repo is no longer used for SIG design discussions. Closing this out since it is unlikely to be helpful. If you'd like to get more visibility on the topic I'd move the convo to the autoscaling mailing list / slack and aim to propose a KEP with the changes.

/close

k8s-ci-robot on 30 May 2020

@danielepolencic this can help https://github.com/practo/k8s-worker-pod-autoscaler

We move out of HPA custom metrics and made this since Queue based scaling has its own intricacies which is better handled separately and works better when tied with the controller. All possible now since we have CRDs :)

We have added the queue which we are using and are looking for people to contribute their queue providers, if they find this useful.