Thanos: Action Plan for Thanos `query-frontend` (Response Caching, Splitting queries)

Created on 17 Apr 2020 · 26Comments · Source: thanos-io/thanos

We need to add new query-frontend subcommand as agreed in https://thanos.io/proposals/202004_embedd_cortex_frontend.md/

This issue will track progress of it. We have a few obstacles to solve first:

Work Plan

(See bottom of the list)

Help wanted for any of those tasks!

feature request / improvement help wanted

Source

bwplotka

👍2

Most helpful comment

Hello :wave:
I can start working on this issue since seems no one is on it.

I investigated Cortex frontend code recently and found that the core is to port https://github.com/cortexproject/cortex/blob/master/pkg/cortex/modules.go#L372-L388 to Thanos.

Since tripperware is just a function, we can reuse the same code in Thanos and removes the unnecessary Cortex code easily.

The current work plan in this issue is clear, so I am going to work on updating Cortex upstream and implementing cache.Cache interface for Thanos IndexCache first, since these two tasks seem independent.

I opened a draft pr on Cortex to remove global registries and loggers https://github.com/cortexproject/cortex/pull/2903. Any help or review would be appreciated!

yeya24 on 20 Jul 2020

❤2 🚀1

All 26 comments

~Maybe good for Community Bridge? :thinking:~

Starting this now as we need this to unblock us internally

bwplotka on 17 Apr 2020

I am looking on some refactor needed for Cortex that will allow also https://github.com/cortexproject/cortex/issues/2178 and maybe Loki caching https://grafana.com/blog/2020/02/05/new-feature-in-loki-v1.3-the-query-frontend/

Do you know @pracucci , @owen-d if Loki is still interested in reusing this code?

bwplotka on 22 Apr 2020

I haven't taken a look yet, but I know @cyriltovena is currently working on the loki results cache implementation.

Could you elaborate on the following? I'm not sure sure what it means.

Potentially on the way - separate from queryreange only

owen-d on 22 Apr 2020

I mean that we have cases for caching more than just queryrange requests, but that's out of scope.

I told to @cyriltorvena offline, got nice feedback :hugs:

bwplotka on 22 Apr 2020

Hey @bwplotka! I was interested in contributing to this proposal! While this seems quite a lot of work, I would like have a try at this subpart -

Refactor IndexCacheConfig to generic cache config so we can reuse. Make it implement Cortex cache.Cache interface.

As I am quite new to the thanos community, so if you could provide some resources to get started with this, it would be really helpful 🤓.

yashrsharma44 on 27 Apr 2020

Hey @bwplotka! I was interested in contributing to this proposal! While this seems quite a lot of work, I would like have a try at this subpart -

Refactor IndexCacheConfig to generic cache config so we can reuse. Make it implement Cortex cache.Cache interface.

As I am quite new to the thanos community, so if you could provide some resources to get started with this, it would be really helpful 🤓.

In #2532 I have just introduced new Cache interface (similar, but not quite the same as one in Cortex), used in caching bucket. My plan is to eventually reuse it for IndexCache as well, but that’s not currently high priority for me. It would however be good to build on top of the same Cache interface.

pstibrany on 30 Apr 2020

👍2

@bwplotka regarding your first action item:

Refactor IndexCacheConfig to generic cache config so we can reuse. Make it implement Cortex cache.Cache interface.

Is there a reason to use Cortex' Cache interface? New interface I've introduced in #2532 is similar, but not the same. Cortex:

type Cache interface {
    Store(ctx context.Context, key []string, buf [][]byte)
    Fetch(ctx context.Context, keys []string) (found []string, bufs [][]byte, missing []string)
    Stop()
}

New interface in Thanos:

type Cache interface {
    Store(ctx context.Context, data map[string][]byte, ttl time.Duration)
    Fetch(ctx context.Context, keys []string) (found map[string][]byte, missing []string)
}

As you can see, it uses maps instead of slices to pass values around, which imho makes it easy to understand and work with. It also adds ttl parameter when storing values, as some caches support that (eg. memcached client in Thanos). It also gets rid of Stop method, which is implementation leak more than necessary part of the interface.

If unification is important for us, we can of course move to Cortex interface, or improve (imho) cache interface and implementations in Cortex to look more like this new one :)

/cc @pracucci

pstibrany on 1 May 2020

One more thing I wanted to mention. Cache interface introduced in #2532 explicitly mentions that Store method may retain (!) values passed in data field. This makes it possible to optimize cache implementations, and client knows to be careful about making copies when needed.

While it's not "safe" from usage point of view, making defensive copies inside Cache implementations would be quite inefficient in cases when they are not needed.

pstibrany on 1 May 2020

Is there a reason to use Cortex' Cache interface?

Thanos query-frontend will be based on Cortex one, but will use Thanos caching clients instead of Cortex one (which makes sense). To make it possible, we need the Thanos caching clients to (also) implement the same interface of Cortex.

I would suggest to design Thanos cache interface considering Thanos use cases and not worrying too much about Cortex compatibility. I'm not sure it's a wise move to change cache interface in Cortex (to be discussed), but we can always write a simple wrapper to make the Thanos cache Cortex-compatible.

pracucci on 4 May 2020

👍2

Agree with @pracucci, we can always create adapters if needed (:

bwplotka on 4 May 2020

👍1

I ended up not having time for this last and this week. So considering putting this as last minute community bridge (:

bwplotka on 6 May 2020

👍2

@bwplotka Just applied on Community bridge

Dhiraj240 on 14 May 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] on 13 Jun 2020

still valid

On Sat, 13 Jun 2020 at 20:55, stale[bot] notifications@github.com wrote:

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or
needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we
can always reopen an issue if we need!). Alternatively, use remind command
https://probot.github.io/apps/reminders/ if you wish to be reminded at
some point in future.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/2454#issuecomment-643670367,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3O5AMOZW2O7RJGDLZGDRWPKTFANCNFSM4MKVRPHQ
.

bwplotka on 14 Jun 2020

stale[bot] on 14 Jul 2020

Still valid.

yeya24 on 20 Jul 2020

Hello :wave:
I can start working on this issue since seems no one is on it.

I investigated Cortex frontend code recently and found that the core is to port https://github.com/cortexproject/cortex/blob/master/pkg/cortex/modules.go#L372-L388 to Thanos.

Since tripperware is just a function, we can reuse the same code in Thanos and removes the unnecessary Cortex code easily.

I opened a draft pr on Cortex to remove global registries and loggers https://github.com/cortexproject/cortex/pull/2903. Any help or review would be appreciated!

yeya24 on 20 Jul 2020

❤2 🚀1

We are talking about splitting queries with @yeya24 and I think we are more & more fans of this approach of splitting queries despite some lack of block awareness. We can many ways to optimize (e.g cache + singleflight) those cases.

bwplotka on 4 Aug 2020

stale[bot] on 4 Sep 2020

Work in progress. cc @yeya24

kakkoyun on 4 Sep 2020

I think the first iteration is almost done. The left thing only is Memcached support.

For the next iteration, what features do we want to implement/have? I can take a look recently. What I can think about are:

Cache all endpoints. Loki already does this, this is doable, and not much code needed.
Dynamic split interval (doable now since we already support this in Cortex), but I think we need a way to benchmark downsampled blocks with a high split intervals
Cache Instant queries. We talked about caching instant queries with offset, but I think a more detailed plan for this might be better.

yeya24 on 4 Sep 2020

👍1

Cache all endpoints. Loki already does this, this is doable, and not much code needed.

This should be on top of our list.

For the rest we can update the issue with a check list and keep this issue open to track the progress. WDYT?

kakkoyun on 4 Sep 2020

Cache all endpoints. Loki already does this, this is doable, and not much code needed.

This should be on top of our list.

We might be blocked by https://github.com/thanos-io/thanos/issues/3119. I can start working on splitting/caching Series API, but I am not sure how many benefits we can get from caching/splitting all endpoints.

For the rest we can update the issue with a check list and keep this issue open to track the progress. WDYT?

Sounds good. But I don't have edit permission for this issue.

yeya24 on 4 Sep 2020

About Series API caching, it is not trivial, because the response result doesn't tells you for which time range the list of series belongs to.

This means re-using a cache entry can only be done if the requested range fully overlap the cache item.

cyriltovena on 5 Nov 2020

About Series API caching, it is not trivial, because the response result doesn't tells you for which time range the list of series belongs to.

This means re-using a cache entry can only be done if the requested range fully overlap the cache item.

Hi, thank you for the reply. You are right and I think this is what I forgot to think about, will open an issue to follow up.

I am curious about how Loki solves this, is this what you mentioned https://github.com/grafana/loki/pull/2202#discussion_r447139926?

yeya24 on 5 Nov 2020

As I mentioned on Loki Community Meeting Series API was never super accurate when used with Grafana Dashboard for variables. If we were to used cached resources, in the worst case it will bring too many series.

The same problem exists for Labels API etc

bwplotka on 5 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings