We are missing caching layer for Querier.
There are multiple design choices we need to make on:
AC:
My plan is to start some yolo PoC for this while reusing awesome code that @tomwilkie created for Cortex: https://sourcegraph.com/github.com/cortexproject/cortex@1d0ff216199e43b7b221774b5cd56936e7d22440/-/blob/pkg/querier/frontend/frontend.go#L103 Hope I can just import it and "run" =D but probably will bumb into import issues, we will see.
Initial thoughts? Feedback? Issue for tracking mostly, and proper proposal will come after short spike.
Initial thoughts:
Should it be built-in in Querier or separate proxy
Built-in as first step, as we might want to make it more complex in future, plust it already alter a query bit (chop it, align) - (e.g mixed caching of results for QueryAPI and StoreAPI). We can always produce proxy-like bit in future..
Should we use memcached backend? Should we support any others?
I have a very good expierience with Memcached so far, we used it everywhere, again, there will be dep hell and code scope creep if we will allow ANY backend, so we need to be careful.
How to structure cache items
:man_shrugging: Need to dive in to Cortex and Trickster caches.
Should we cache Query API result or actually StoreAPI results?
Result is easy win for now, but I feel like something in middle (caching PromQL evaluations) might be better. I think we should start with QueryAPI results, benchmark and iterate. Also worth to sync with Cortex guys on this - they are solving same problem.
Should we do it near QueryAPI or on federated Querier as well?
Query API for now, as we care about caching results as a first step.
Should we just DON'T do it and leave that fully to Trickster: https://github.com/Comcast/trickster
IMO, no as "Trickster" is not working well for users, mostly because of lack of understanding of PartialResponse strategies Querier allows. Also we would be forced to use results caching only.
From my observation the two most popular external network/cluster cache protocols right now are memcached and Redis. Both have support for self-hosting and cloud providers offer them as a service.
I would suggest sticking to external caching of data that could be shared between multiple query instances.
Trickster works reasonably OK. There is a next branch that should in theory improve a number of things. But, I agree, it doesn't really understand the data model, plus as a "dumb" cache, it can't to cache eviction.

Not bad in terms of deps I guess.
The work to embed caching in Querier is potentially no longer needed as you can run Cortex query-frontend on top of any Prometheus query range API (: You can learn more about this in meetup video here
It's definitely a way to go, we already started to run thins on Production with Thanos (: We are now discussing the possibility to move query-frontend to separate neutral project: here
This will have many benefits e.g will allow us to properly document, and recommend using this. We can also definitely discuss the possibility to embed this logic inside Querier, but it will be much easier if query-frontend would be a separate project to do so (dependencies).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
From my observation the two most popular external network/cluster cache protocols right now are memcached and Redis. Both have support for self-hosting and cloud providers offer them as a service.
I would suggest sticking to external caching of data that could be shared between multiple query instances.
Trickster works reasonably OK. There is a
nextbranch that should in theory improve a number of things. But, I agree, it doesn't really understand the data model, plus as a "dumb" cache, it can't to cache eviction.