Keda: Long Running Workloads

Created on 7 May 2019  ·  19Comments  ·  Source: kedacore/keda

I've been browsing through the source code, so apologizes if this is supported and I just missed it.

Does keda scaling support long running workloads from queues?

If keda scales up a deployment to 1 pod because there is a queue with an item (assuming that's your scale config), and that pod pulls down the job to work on it and it's long running (say hours), will keda scale the deployment back down causing k8 to kill the pod once the cool down period is reached? From code browsing, looks like it would?

If it's not supported now, any plans on the roadmap? Or just not a use case keda is designed for?

pending enhancement needs-discussion

Most helpful comment

Yes the only gap now is documentation. I think the two recommendations are:

  1. Leverage the pre-hook for termination to delay scaling down of a pod until your application returns the correct SIGTERM to let Kubernetes know it’s ok to scale down. The upside of this approach is you can still use standard deployments / concurrent containers. The downside is if a job is delaying scale down for a long time (say hours) the Kubernetes API will set the status as “terminating...” which may be confusing to operators.
  2. Use the new jobs functionality to consume long running tasks as a job. The upside to this is there’s a single atomic job per long running operation, and status will just be “running”. Downside is it requires you architect your app in a way so the container only pulls a single message and then terminates.

All 19 comments

will keda scale the deployment back down causing k8 to kill the pod once the cool down period is reached? From code browsing, looks like it would?

Yes today that will happen based on the cooldownPeriod which defaults to 5 minutes. It's however a static value per ScaledObject. So if the long running workload is not consistent in how long it takes, you'd have to set the largest possible value.

Ideally the scaledown logic should measure the cooldownPeriod from the last function execution or function in progress heartbeat, not the last check of the event source. This would require keda to have access to the metrics emitted by the container.

I think enabling keda to incorporate Prometheus queries into its scaling logic would allow for that logic, but we didn't want to have a hard dependency on Prometheus. #156 is tracking Prometheus integration.

I guess one gross hack today is to have pod workers continue to update the ScaledObject update stamp while processing a long job.

That should prevent the cool down from triggering? Assuming you update faster then the cool down time.

Currently no plans to address this, but I don't think the use case isn't valid.

Like @ahmelsayed, if you know your processing time takes X, you can set the cooldownPeriod to X.

Having the pods report ongoing work to Keda or have Keda probe pods for ongoing work is a path one does not take lightly..

@whilke The best solution is to remove the item from the queue after you finish processing your job, and not before.

Circling back on this one as it's an area we've discussed some and wanted to get thought down. I think there are 2 scenarios here:

  1. Preventing KEDA from scaling a deployment to zero if there is work being executed
  2. Preventing the HPA from scaling down a deployment that may have active work happening on it

In the case of 1 I think the recommendations for now would be to:

  • Don't remove the event item / checkpoint / complete until processing is completed. Ideally KEDA still sees at least 1 item still alive on the queue. This isn't always possible, but where it is, it's the best way to handle.
  • You may need to make your cooldown period as long as the potential execution time.
  • Potentially something else like whatever is the result of #2 below

For the case of 2, that can happen when the rate of events may decrease and the Kubernetes HPA decides to scale down the deployment. If you have 5 replicas and replica A is 20 minutes into a 40 minute execution, it's possible that HPA will scale down replica A and terminate it. In this case we've discussed a few options. I'm not sure what the best one would be, but I plan to bring this up with the autoscaling SIG and see if any thoughts around what could be done to "postpone" or delay scaling down an instance if it signals somehow that it is the middle of some work. @yaron2 also brought up that maybe the right Kubernetes paradigm for these types of long running jobs may be a Kubernetes Job - created this issue to help track that potential approach as well.

Related: https://github.com/kubernetes/kubernetes/issues/45509 - discussion around scaling down via removing specific pods & queue driven long running workloads

@jeffhollan, Jobs as they are implemented today would have a lot of problems handling queue based long running work loads.

They work well for processing through a queue of items, and can scale the jobs based on your own parallel logic, but once the queue goes cold and all the jobs finish you'd need some middleware watching the queue to spin up a new job when new items come back in.

Sounds ripe for race conditions, or over spinning jobs.

@yaron2 That doesn't solve the problem, and in fact wouldn't that result in spinning up to max pods? If you have a scale metric set for 1 queue item, and you start processing the only item but keep it in the queue, keda & hpa is going to see you still have items in your queue and will keep scaling up to try and handle it.

I also don't know of any queue implementations that will keep the item visible in the queue while you're processing it, otherwise it can get picked up by another listener. That goes for at most once, and at least once implementations.

and in fact wouldn't that result in spinning up to max pods?

No, that's not the case. the HPA will scale to meet the threshold, it (for good or for bad) has no knowledge of whether or not you actually finished processing the work. It will bring up the number of pods that meet the threshold metrics.

I also don't know of any queue implementations that will keep the item visible in the queue while you're processing it

Basically every queue system I know has a Peek-Lock functionality or means of achieving it. (noAck for Rabbit).

A consumer can lock and get the message, and the message gets deleted from the queue when the job is done.
If the consumer crashes, there's usually a timeout that releases the lock after a certain period of time.

Queues (normally) support a single consumer per messageID per queue, meaning multiple consumers can't get the same message on a given queue.

@lee0c is working on a proposal for the "job dispatch" pattern. Any questions, comments, upvotes, etc. should be focused on #199 to see if that's one viable option to solve this that KEDA could provide. I provided some background and questions on that issue as well. Thanks @lee0c

I have a slightly alternative proposal for long running jobs that would also work for functions and other "listener" style applications rather than job style application.

Similar to this proposal, we would have a scaleType: with current behavior being hpa, and options are simple, hpa. In simple scale mode, there is no HPA. Keda handles the scaling. This will mean that you'll loose HPA scaling metrics like CPU and memory, but you should still be able to get Keda metric scaling.

Not having HPA allows us to implement a simple "I'm busy" protocol with the application.

Possible options:

  1. Keda-injected endpoint the application reaches out to:
  2. File modified time in the container filesystem.

For #1: Keda could inject a KEDA_STATE_URI=https://<keda-service>:<port>/api/.... KEDA_STATE_TIMEOUT=5m. The app is expected to curl the url every 5 minutes if trigger is not active but busy. This will reset the last-active timer in Keda for the deployment

For #2: The app is expected to touch a file KEDA_STATE_FILE=/run/application/is_busy every timeout period if trigger is not active, but the app is busy. keda will exec stat the file and reset last active timer if needed. This requires stat in the container.

This should allow most applications (from functions to bash scripts) to easily signal out to Keda that they are busy.

For #1: Keda could inject a KEDA_STATE_URI=https://<keda-service>:<port>/api/.... KEDA_STATE_TIMEOUT=5m. The app is expected to curl the url every 5 minutes if trigger is not active but busy. This will reset the last-active timer in Keda for the deployment

Does the app have to initiate it or does KEDA initiate it? Personally I'd prefer not to change my app because we choose to use KEDA. Having a "busy" endpoint next to (or even same as) my health endpoint would be a better fit which can be consumed.

For #1: Keda could inject a KEDA_STATE_URI=https://<keda-service>:<port>/api/.... KEDA_STATE_TIMEOUT=5m. The app is expected to curl the url every 5 minutes if trigger is not active but busy. This will reset the last-active timer in Keda for the deployment

Does the app have to initiate it or does KEDA initiate it? Personally I'd prefer not to change my app because we choose to use KEDA. Having a "busy" endpoint next to (or even same as) my health endpoint would be a better fit which can be consumed.

This is the approach we went (having the scaler check an instance if it's busy). It follows the normal health/ready check pattern. Also reduces race type conditions as it's a live check when the scaler is looking to scale down a set.

We could do that too. My only reservation is on expecting the pod to have an HTTP endpoint. Since all KEDA applications are meant to process polling-based events. I'm not sure most pods would need to expose an http endpoint, which depending on the framework you're using, it might not be as simple to add to a simple queue or topic consumer app.

Since KEDA doesn't require deployments to expose a service, we will need KEDA to port-forward from each pod to query the http endpoint inside.

To me option #1 above is if you want to push your state into KEDA and option #2 is if you want KEDA to poll your state, but a file is simpler than exposing an http endpoint.

option #1 and option #2 smell, a lot.

#1 makes your data very stale and introduces easy race conditions from when keda last got status and when it's terminating pods (where the pod might have picked up a new job by then).

#2 requires workers to implement the status file correctly or it's worse then #1 (same race condition issue plus all the overhead of requiring a exec stat. Still doesn't stop race conditions completely since it's a disconnected status system.

KEDA could create a headless service for the deployment. Which would give a CNAME DNS entry for every pod so it can connect directly to an endpoint vs port-forward.

I still think it's the best option, for an optional feature, even though it requires the worker to listen on a socket

I'm not sure how exposing an endpoint resolves the race condition though. KEDA could query the endpoint and get "not busy", then start terminating the pod while the pod has picked up a new work item in the mean time. To eliminate that race condition we'll have to re-implement the SIGTERM, SIGKILL approach of kubernetes where we inform a pod that we want to shut it down, and the pod says "yes or no", and if it says "yes, I'm ready to be terminated", then it's on your application to not accept any new tasks until it's terminated. This is quite a bit more involved than just an "is busy" end point.

What are the benefits of creating a service per-pod vs port-forward a pod while checking its status? Also what are the benefits of requiring every queue consumer to implement an HTTP interface to signal busy state as opposed to touch $KEDA_STATE_FILE or curl $KEDA_STATE_URI on a timer while processing a queue message that you expect to take a long time given that both mechanisms are susceptible to the same type of race condition?

If the problem we're trying to solve is KEDA terminating a 3 hour job that's 2 hours into processing, then those race conditions are not applicable.

If the problem is KEDA terminating a pod that has received a workitem between the time of KEDA checking the "busy" status and terminating the pod, then we need the pod to cooperate in the shutdown process, which is more involved.

@ahmelsayed sorry for resurrecting an old thread. Isn't this exactly what liveness probes does? Are you suggesting using it or implementing it as brand new? I am playing with the pod disruption budgets to see if HPA respects it. If thats the case, we can have the pod clear the label when it is not "busy" and let HPA scale it down

Is this still open? I see https://github.com/kedacore/keda/pull/322 merged which gives external checkpoint functionality, was wondering if these 2 issues are different.

Yes the only gap now is documentation. I think the two recommendations are:

  1. Leverage the pre-hook for termination to delay scaling down of a pod until your application returns the correct SIGTERM to let Kubernetes know it’s ok to scale down. The upside of this approach is you can still use standard deployments / concurrent containers. The downside is if a job is delaying scale down for a long time (say hours) the Kubernetes API will set the status as “terminating...” which may be confusing to operators.
  2. Use the new jobs functionality to consume long running tasks as a job. The upside to this is there’s a single atomic job per long running operation, and status will just be “running”. Downside is it requires you architect your app in a way so the container only pulls a single message and then terminates.
Was this page helpful?
0 / 5 - 0 ratings