Thanos: Response caching for Thanos

Created on 15 Oct 2019 · 17Comments · Source: thanos-io/thanos

This is an umbrella issue for things we might need to do/contribute in order to have smooth Thanos response caching.

As you probably know Cortex query-frontend now supports Prometheus API, so - it can be put on top of Thanos. Still, there are some cases we have to/can improve for Thanos usage:

Must:

[ ] Understanding underlying block structure and downsampling handling. Query frontend splits by day. This does not make sense for our downsampling which helps to run queries for large time ranges. Also split by day when querying 2w block might be not that efficient as most of the index lookup has to be 14x duplicated. However, splitting allows for easier load balancing and distributing the queries across system. This means that we might want maybe a bit smarter query split. Some idea would be:

// thanosQuerySplitIntervalFn returns split interval based on query step. This is based on the fact that
// Thanos in auto-downsampling mode will choose also use step when choosing what resolution to use.
//
// Splitting rationales: 
// Cortex splits by day and it's quite a good ratio of chunks per series to fetch. For 1 day we will have approx 24 chunks/series (30s scrape). Also 
// To max keep 24 chunks per series for 5m resolution it means 10 days split, for 1h resolution 120 days.
var thanosIntervalFn = func(r Request)time.Duration {
    step := time.Duration(r.Step()) * time.Millisecond
    if step > 30 * time.Minute {
        return 120 * day
    }
    if step > 5 * time.Hour {
        return 10 * day
    }
    return day
}

This might be some improvement, but still, there are some edge cases:

Fallback to raw chunks if downsampled blocks are still missing
Downsampled blocks appear only after 40h (5m) and 10d (1h), so this logic might need to consider time as well.
Split by day efficiency might differ across time, depending on how many blocks are touched.

I think there is still no clear idea of what's the best practice there. Ideally, we would split based on the underlying block structure, but there is no easy way to have that info during query planning level.

Maybe treating query splitting and splitting response a separate steps make more sense? Right now response caching relies on splitting as the cache key depends on queried start time. This is to avoid lookup logic in the caching layer.

[ ] Allow setting custom query params to Cortex frontend
- [ ] Auto downsampling has to be used for the above interval based on step logic.
- [x] Partial response must NOT be cached. This can be achieved by simply setting --query.nopartial-response` however we can also add some logic for caching to be avoided if the response has some special header.
- NOTE that currently partial response is not assumed if StoreAPI is down for short time for some reason (INFO not responding, Service discovery not discovering it for some time). We might need to change this to assume partial response in those cases as well to avoid the wrong cached results. The alternative is to invalidate the cache in those cases. Which below point enables.
[ ] Adjust for StoreAPI changes: Expose on Querier some info_hash HTTP API that will be used by cacher to get hash and use it as cache key? How to ensure we are up to date with store API changes? Maybe short TTL is fine?

Could:
~* [ ] Discuss cortex to move query-frontend caching logic out of Cortex project: https://github.com/cortexproject/cortex/issues/1672~

[x] Use it inside Querier directly: https://github.com/thanos-io/thanos/issues/1006 as a library.

Anything I missed? Feedback, ideas, and help wanted (:

query-frontend feature request / improvement help wanted stale

Source

bwplotka

👍16

All 17 comments

Updated this with recent findings and ideas. Design doc and MVP in progress (:

bwplotka on 17 Nov 2019

I feel like we should return a partial response if there were StoreAPI nodes which were added without a discovery mechanism and they were unhealthy i.e. no response to Info(), kind of like static scrape targets in Prometheus. We shouldn't do the same for dynamically detected nodes since they can go away easily. @bwplotka does this logic sound good?

GiedriusS on 20 Jan 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

stale[bot] on 27 Feb 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

stale[bot] on 28 Mar 2020

We should document Cortex's query-frontend as the recommended solution + how to enable the most usual use-cases with --store-strict.

GiedriusS on 13 Apr 2020

Yes. But there might be even better solution. Issue coming.

On Mon, 13 Apr 2020, 22:47 Giedrius Statkevičius, notifications@github.com
wrote:

We should document Cortex's query-frontend as the recommended solution +
how to enable the most usual use-cases with --store-strict.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/1651#issuecomment-613113957,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3O7ZJJDD5QR7XNPHG4TRMOB65ANCNFSM4JA27Y7Q
.

bwplotka on 14 Apr 2020

Proposal that should solve your concern @GiedriusS (and mine as well): https://github.com/thanos-io/thanos/pull/2434

bwplotka on 14 Apr 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] on 14 May 2020

Will be done as part of https://thanos.io/proposals/202004_embedd_cortex_frontend.md/

bwplotka on 18 May 2020

stale[bot] on 17 Jun 2020

in progress!

On Wed, 17 Jun 2020 at 19:15, stale[bot] notifications@github.com wrote:

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or
needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we
can always reopen an issue if we need!). Alternatively, use remind command
https://probot.github.io/apps/reminders/ if you wish to be reminded at
some point in future.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/1651#issuecomment-645538197,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3O5OGHQX4XU5BDWCWHTRXEB2VANCNFSM4JA27Y7Q
.

bwplotka on 18 Jun 2020

🎉1

stale[bot] on 18 Jul 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

stale[bot] on 25 Jul 2020

https://github.com/thanos-io/thanos/pull/3018 is merged so I think it is time to get back to this issue.

Just opened a pr on the Cortex side to support dynamic split interval https://github.com/cortexproject/cortex/pull/3066. It would be great if anyone can help review it!

@bwplotka Would you mind updating the issue checklist? I think Partial response must NOT be cached. is already done.

yeya24 on 21 Aug 2020

👍1

Updates on this issue:

Auto downsampling has to be used for the above interval based on step logic. and Allow setting custom query params to Cortex frontend should be done.

For the dynamic split interval for downsampled block, sorry I don't really understand how to calculate 10d, 120d, 30m and 5h here, can you please explain it? Since we already support getting max_source_resolution directly, we can use it here directly I think.

var thanosIntervalFn = func(r Request)time.Duration {
    step := time.Duration(r.Step()) * time.Millisecond
    if step > 30 * time.Minute {
        return 120 * day
    }
    if step > 5 * time.Hour {
        return 10 * day
    }
    return day
}

Fallback to raw chunks if downsampled blocks are still missing

How does query frontend know that downsampled blocks are missing? I think we don't need to deal with this on query frontend.

Downsampled blocks appear only after 40h (5m) and 10d (1h), so this logic might need to consider time as well.

This is something we talked about offline, but it is not very clear. How does query frontend know that there are downsampled blocks in bucket? I think it is not doable.

yeya24 on 4 Sep 2020

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] on 3 Nov 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

stale[bot] on 20 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings