Caching happens in-memory and does not seem to be configurable to use any other back-end.
Allow for custom cache-backends. Either out-of-box for common ones (e.g. Redis) or simply provide some custom hooks/a class that can be used to configure the back-end (init, get, set, etc. the usual actions).
It's not very viable to store large amounts of data in-memory, and it's also not viable if you wanna cache something for a longer period of time (e.g. hours, days, weeks) since it clears on restart.
Hi @fgblomqvist! Great points. Memory management could definitely be improved. We have something in-flight internally that may solve this problem in another way, so let me know what you think about this:
We started design talks internally for a "target API" (inspired some by Make) which would store return values from tasks to a configurable persistent location such as a local file, S3/GCS key, or any arbitrary persistence option (perhaps Redis). Since the serialized location will be known to the task not only in-memory (like the current cache implementation) but at task definition (for example configured like @task(target=LocalFile('~/.prefect/results/{run_id}/{task_id}'))), a task or dependent task can pick up and deserialize a tasks' results from disk, whether in an entire new Python process or within the same process that serialized it but threw away the Python reference to save memory.
This is basically a merger of the prefect "Cache" and prefect "ResultHandler" interfaces that effectively makes the cache persistent.
Do you think something like this would solve your problem? Since we are actively developing this internally now, I wanted to get your thoughts on it.
If not, we could work to independently pursue the memory management problem as written. As you mention, today the cache is trapped in memory as python objects on the in-memory collection of State objects inprefect.context.caches. I imagine this is about greedily getting rid of any in-memory State._result.value and abstracting prefect.context.caches so that it can use a different backend then just a Python dict (as you mention, one that implements a common cache interface like .get, .set so its swappable is best).
Thanks for the lengthy answer!
Interesting solution. While that would certainly solve the memory issue, it seems perhaps a tad too fine-grained to configure the exact storage on a task-per-task basis. I could imagine a simpler option being to set something like target_external=True (but with a better name) which would tell the flow running the task to write/read the result somewhere else than in-memory. Where exactly could be configured on a flow level (e.g. similarly to how you described it above). Now, you could just offer both options. What I had in mind was mainly the re-usability of the tasks in question (e.g. what if the task is set up to use S3 and you want to use it in a flow that just shouldn't have S3 access?).
I'm not a workflow wizard by any means and I know you might be designing with Cloud in mind, but these are my 2 cents at least :)
As for whether this stuff would solve my problem: I think so? Assuming there will still be some cache settings on the task (e.g. for how long to persist it) then yeah that sorta works. It would also go hand-in-hand with allowing the flow to set up the target, since then you could have separate caches/storages on a per-flow basis (which might be good in some situations).
@fgblomqvist Does the _new_ results interface that was introduced in 0.11.0 satisfy some of the points outlined here?
https://docs.prefect.io/core/concepts/results.html#how-to-configure-task-result-persistence
https://docs.prefect.io/core/idioms/targets.html
Looks like it has improved a bit :)
From the looks of it, it seems to satisfy all the things I mentioned. Feels very flexible which is great. While I am unfortunately no longer working on a project that uses Prefect, I'm sure this stuff will come in handy for others (and perhaps I'll be using Prefect at some point again). I think this issue can be closed.
Most helpful comment
Hi @fgblomqvist! Great points. Memory management could definitely be improved. We have something in-flight internally that may solve this problem in another way, so let me know what you think about this:
We started design talks internally for a "target API" (inspired some by Make) which would store return values from tasks to a configurable persistent location such as a local file, S3/GCS key, or any arbitrary persistence option (perhaps Redis). Since the serialized location will be known to the task not only in-memory (like the current cache implementation) but at task definition (for example configured like
@task(target=LocalFile('~/.prefect/results/{run_id}/{task_id}'))), a task or dependent task can pick up and deserialize a tasks' results from disk, whether in an entire new Python process or within the same process that serialized it but threw away the Python reference to save memory.This is basically a merger of the prefect "
Cache" and prefect "ResultHandler" interfaces that effectively makes the cache persistent.Do you think something like this would solve your problem? Since we are actively developing this internally now, I wanted to get your thoughts on it.
If not, we could work to independently pursue the memory management problem as written. As you mention, today the cache is trapped in memory as python objects on the in-memory collection of
Stateobjects inprefect.context.caches. I imagine this is about greedily getting rid of any in-memoryState._result.valueand abstractingprefect.context.cachesso that it can use a different backend then just a Python dict (as you mention, one that implements a common cache interface like.get,.setso its swappable is best).