@kubernetes/sig-autoscaling-proposals
@kubernetes/sig-autoscaling-reviews
The current best practice to autoscale pods based on the number of messages in a queue is to use an HPA with a metric of type External.
While this is good in theory, it's somewhat inconvenient in practice.
As a developer, I'd rather not expose the underlying message broker. In fact, I'd probably create an abstraction to access it.
If later on, I wish to replace RabbitMQ with Kafka, I can make the change and be confident that nothing will break because the public interface hasn't changed.
So while _External_ might have specific use cases, it's a less convenient way to scale my (micro) services since it exposes the message broker directly.
A more attractive option is to wrap the queue into a microservice.
Imagine you have a publisher, a consumer and a queue that decouples the two.
The publisher sends messages to the queue, the consumer picks them up at its own pace.
Typically, I'd expose custom metrics for the queue from the consumer, the publisher or both since they already know how to talk to the queue.
If you access /metrics
on the consumer, you should expect a custom metric that contains the length of the queue (amongst other things). If you have three pods and there're four messages in the queue, you could request the /metrics
endpoints for each of them and expect:
# HELP messages Number of messages in the queue
# TYPE messages gauge
messages 4
You could leverage an HPA with a metric of type Pods to autoscale the workers.
Unfortunately, the _Pods_ metric for the HPA only does an average of value. And since all consumers expose the same value the average is the value exposed by each Pod — it's always four, independently of the number of pods.
In other words, adding more Pods doesn't change the average value of the metric.
In this instance, the autoscaler will keep doubling the Pods until the metric is under the threshold. The number of Pods is not scaled proportionally with the size of the queue.
The other option is to use an HPA with a metric of type Object. Which is excellent because the metric computes a ratio between the target and the current value. So you could have the workers scale up linearly with the size of the queue. Unfortunately, the Object type cannot be applied to Pods that are created by a Deployment (or ReplicaSet). And from an engineering perspective, I need _someone_ to look after my Pods when they die. So HPA with Object is not an option.
Long story short.
As an example, if I defined a ratio of 10 messages for Pod and there're 72 messages in the queue, I'd expect the HPA to settle the number of replicas to 7.
As far as I understand, none of the current solutions allows me to do that. I'd like to put forward a proposal to do that.
/sig autoscaling
I'm trying to achive the same, with the only exception that I'm running a dedicated exporter for the queue length. I have no problem using the Object based HPA, though? I don't understand the limitation you have in that area.
That being said, I struggle with scaling smoothly. For some reason I can't confirm that the HPA sets the replicas to current/desired
. Instead what I see is that the pod count is doubled each time I reach the threshold. Then, when the metric is below threshold it doesn't do anything for several minutes until it scales to min replicas. I'd expect it to scale down lineraly with the queue size but somehow it doesn't.
Object-based HPA is for a single object. While it might work well for an ingress, it doesn't work well for a pod. In fact, I don't deploy just pods. I deploy pods in deployments or replica sets. If a pod dies, it is respawned (by the RS). That doesn't play nicely with the Object HPA because it assumes the name doesn't change.
The current algorithm for scaling application is not linear. The code gives a conclusive answer. Which translated in plain English reads as:
MAX(CURRENT_REPLICAS_LENGTH * 2, 4)
So it doubles all the time. You can notice a scale up or a scale down only every 3 and 5 minutes respectively. You should tune the parameters if you wish to have a more (or less) reactive cluster:
horizontal-pod-autoscaler-downscale-delay
horizontal-pod-autoscaler-sync-period
You should set those in the controller manager. I test the above in minikube with the following command.
minikube start \
--memory 8096 \
--extra-config=controller-manager.horizontal-pod-autoscaler-upscale-delay=1m \
--extra-config=controller-manager.horizontal-pod-autoscaler-downscale-delay=2m \
--extra-config=controller-manager.horizontal-pod-autoscaler-sync-period=10s
I discussed all of the above in a nice blog post if you wish to have some context for my proposal.
@danielepolencic Thanks for the link to the code, that's very helpful!
I too run pods via deployments, so still not sure about your problem. You can scale a deployment based on a object metric:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: foo-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: foo-deployment
minReplicas: 1
maxReplicas: 20
metrics:
- type: Object
object:
metricName: some_metric
targetValue: 1
target:
kind: Deployment
name: foo-deployment
In my case, the metric is exposed by the Pod, not the Deployment. Unless Deployments aggregate custom metrics from Pods, I won't be able to use that.
@danielepolencic Are you using Prometheus? In that case you can use relabling to add a deployment label to any metric. In my case it comes from an exporter of a different deployment but even if all your pods expose the metric, this should be doable with something like this:
- record: queue_messages
expr: label_replace(queue_messages, "deployment", "your-deployment", "", "")
You could also join the metric with kube_pod_info
if you don't want to hardcode this.
That being said, I agree the overall UX should be improved here.
Where do I write that line? In the HPA?
@danielepolencic It's a recording rule for prometheus, that creates a new timeseries which k8s-metrics-adapter should pick up as custom metric: https://prometheus.io/docs/practices/rules/
That being said, I realized that just scaling on the queue length doesn't work well for queues with high throughput, since it will always scale to min if the queue ever becomes empty (which it does usually in our case), which causes our queue to fill up rapidly, causing it to scale up to max.
We'll solve this by scaling on both the queue size and the ingress rate but I agree that in general this should be easier. Something like a PID loop might make most sense.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Hi.
I consider to support scaling for RabbitMQ priority queues. I think about these metrics:
The use case is online and offline job processing.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Something to consider (and to to make sure I understand this correctly...): scaling will not be linear with the queue size, and the number of workers will never be proportional to the number of messages. The growth rate will be (in theory, in practice as you've shown in the code, it's not, but never mind...).
The auto-scaler always tries to reach a ratio of 1 between the metric and the target. So if you are using the queue size as your metric, as long as the queue size is larger than the target, it will scale up, and up and up, until such time as it reaches the target, in which case it will stay stable, and if decreasing (obviously more pods will consume messages faster, thus the queue will start to shrink), it will then do the opposite (again, exponentially, with the velocity being proportional, not the actual size). What you need to know is the "ideal" size of your queue, that is the size that gives you the best performance, and have that as a target. Unlike "Pod" metrics, with "Object" metrics, the autoscaling API (at least v1beta1) does not divide by the number of current pods, so if you we want proportional scaling, we need something else. Not sure what, though actual messages handled per second reported by each pod sounds good enough...
@regmilton Thanks for this. I couldn’t have put it better myself.
so if you we want proportional scaling, we need something else.
As a developer, that would be my expectation when I design a queue based architecture. The more messages in the queue, the more Pods I'd like to be scheduled.
actual messages handled per second reported by each pod sounds good enough
That sounds close enough to me. But is it applicable to "Pods" or "Objects"?
At the moment I use 3 metrics for pod autoscaling for 3 cases:
As a developer, that would be my expectation when I design a queue based architecture. The more messages in the queue, the more Pods I'd like to be scheduled.
That only applies when you catch up on a queue. I felt the same way but it didn't work well, as soon as the messages drop to zero the autoscaler will scale to zero causing the queue to back up quickly. @regmilton describes this very well.
@discordianfish _I think_ that's still fine. Catch up queues with a minimum number of replicas in the HPA should do the job. If the scaling is delayed, that's fine. I'm using queues because I want to decouple producers from consumers, I don't wish to have immediate scaling as more messages are poured into the queue. Having the Pods scaling immediately as more messages are produced sounds like a sync-like behaviour to me.
@danielepolencic - I'm facing the same issue. Aside from the inconvenience you listed, wouldn't the metric of type External satisfy the requirement since you can describe how the metric is used (Value, AverageValue or Utilization)? Perhaps I'm missing something.
For full disclosure, I'm currently using a custom metric implemented via the Prometheus adapter, but do plan on converting to an external metric once the PR is merged. I'm most definitely expecting this to work as advertised.
/remove-lifecycle rotten
I thought External is meant for external resources such as Azure Service Bus or Google Big Query and not Pods. Can you use external with objects inside the cluster?
The way I read this is that the _metric_ itself is external to the cluster. How that metric is ingested is somewhat irrelevant; for instance, you could have Prometheus scrape the data directly from the Azure Service Bus, or AWS SQS. But nothing prevents you from exposing that metric to Prometheus from inside the k8s cluster: the object used to expose that metric (perhaps a Service) is completely irrelevant as far as the HPA is concerned.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/reopen
/remove-lifecycle stale
/remove-lifecycle rotten
/lifecycle freeze
/cc @fejta Feedback: I don't like it.
As an FYI we attempted to solve some of these problems to "wrap" the external metrics behind a custom resource definition called a ScaledObject
in our project KEDA. I'd love to see more of this just baked into the HPA, but for now it allows defining a target threshold (e.g. Kafka lag or RabbitMQ queue length) and driving scale of a deployment on that, as well as potentially scaling all the way to zero on no events.
/remove-lifecycle frozen
This works for us, and may work for others that are experiencing this issue. We utilize a metric that captures the number of messages dequeued in the last minute and specify the threshold using targetAverageValue. Because our microservices are throttled at X number of messages per minute, we can detect if they're under full load.
This is used in conjunction with a queue size metric. If the queue size exceeds Y value, we'll scale up the number of instances. Even if the queue size drops to 0, if the average number of messages dequeued by each instance matches X, it will not scale down.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen
.
Mark the issue as fresh with/remove-lifecycle rotten
.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
/remove-lifecycle rotten
@itssimon: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
/reopen
/remove-lifecycle rotten
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
And another issue "fixed" by @fejta-bot... 💪
@danielepolencic would you mind reopening as it remains a relevant and open issue / feature request.
/reopen
@danielepolencic: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I know this was just reopened -- but the community repo is used as the general management point of Kubernetes contributor community. Discussions like that are not likely to show up on the SIG stakeholder's radar. =/
If you'd like to get more visibility on the topic I'd move the convo to the autoscaling mailing list / slack and aim to propose a KEP with the changes.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
As mentioned 90 days ago, this repo is no longer used for SIG design discussions. Closing this out since it is unlikely to be helpful. If you'd like to get more visibility on the topic I'd move the convo to the autoscaling mailing list / slack and aim to propose a KEP with the changes.
/close
@coderanger: Closing this issue.
In response to this:
As mentioned 90 days ago, this repo is no longer used for SIG design discussions. Closing this out since it is unlikely to be helpful. If you'd like to get more visibility on the topic I'd move the convo to the autoscaling mailing list / slack and aim to propose a KEP with the changes.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@danielepolencic this can help https://github.com/practo/k8s-worker-pod-autoscaler
We move out of HPA custom metrics and made this since Queue based scaling has its own intricacies which is better handled separately and works better when tied with the controller. All possible now since we have CRDs :)
We have added the queue which we are using and are looking for people to contribute their queue providers, if they find this useful.
Most helpful comment
In my case, the metric is exposed by the Pod, not the Deployment. Unless Deployments aggregate custom metrics from Pods, I won't be able to use that.