thanos 🚀 - Allow setting retention per metric (e.g rule aggregation)

There have been discussions on per time-series retention on upstream Prometheus before, I think at least having a discussion with the team is worth it, just to see if there are any insights from back then.

proposal is in place

What do you mean by this? As far as I can tell there is no design written up anywhere, but may very well have missed it.

Off the top of my head, this could be a configuration which is a combination of a Prometheus style label selector, as well a respective rule of which resolution to keep for how long.

As a whole this is definitely not trivial, but I agree much needed.

brancz on 11 Mar 2019

Sorry - no, proposal has to be in place, that's what I meant.

(: Relabel-like config makes sense but essentially we are talking about rewrite in compactor for this right?

bwplotka on 14 Mar 2019

Configuration is a technicality, I'm not entirely sure relabelling would work exactly but something close to that probably yes. I agree the compactor is the component that needs to take care of this by re-writing blocks.

brancz on 15 Mar 2019

With the federation system we have in place now, we have trained users that want metrics to be preserved (federated) have to use a specific recording rule style name so that the metrics would get federated. We would love to continue this practice with Thanos and only retain metrics that meet a specific format for an extended period of time.

ipstatic on 18 Mar 2019

👍2

Some offline discussions revealed that users still do it with Thanos, but using federation + Thanos on top.

The point was to shard ingestion and execute recording rules against each sharded ingestion gateway (which runs with minimal retention).  We have some very high-cardinality metrics and centralized ingestion alone was prohibitively expensive.  That prometheus federation layer allows us to compute/ingest aggregates (without retaining the raw metrics).

I think we should aim to allow users to avoid this, but cannot see immdiately blocker for that, otherwise than more complex system and query being able to fetch data with some lag (rule eval lag + federated scrape)

bwplotka on 1 Apr 2019

Just saw these discussion threads on potential feature requirement per metric retention on compactor, do we still look for this feature on compactor side?

Actually we are having same per metric retention requirement in our business scenario. We are trying to implement one policy based retention function to replace current compactor's retention function. Our idea is that providing one policy config file for retention function, the users can specify the promql expression to define the retention time for some metrics. The below is one sample policy config file:

policies:
  - expr: "{}"
    retentions:
      res-raw: 180d
      res-5m: 240d
      res-1h: 400d
  - expr: "{__name__=\"^go_memstats_.*\"}"
    retentions:
      res-raw: 90d
      res-5m: 180d
      res-1h: 360d
  - expr: "{__name__=\"go_memstats_gc_cpu_fraction\"}"
    retentions:
      res-raw: 200d
      res-5m: 300d
      res-1h: 400d

I would like to know your thoughts on this idea.

smalldirector on 4 Nov 2019

Just an FYI - not retaining raw data will lead to problems (you won't be able to zoom into your metrics anymore): https://thanos.io/components/compact.md/#downsampling-resolution-and-retention:

In other words, if you set --retention.resolution-raw less then --retention.resolution-5m and --retention.resolution-1h - you might run into a problem of not being able to “zoom in” to your historical data.

wogri on 6 Nov 2019

@wogri I think you are using grafana for visualization. Checkout PR https://github.com/grafana/grafana/pull/19121. At the moment this PR is not in a grafana release, therefore I use the master.
I added three prometheus datasources, each with an other max_source_resolution parameter. Now I can switch to the best resolution and zoom into my metrics. I disabled also auto-downsampling in thanos query component, because it doesn't really work good.
If you are using rates you should make your interval flexible, because on resolution one hour you should set the timerange to two hours.
I think, I will write a thanos PR with some more explanation, if grafana is released with the above PR.

Reamer on 6 Nov 2019

Thanks @Reamer!

wogri on 6 Nov 2019

Just saw these discussion threads on potential feature requirement per metric retention on compactor, do we still look for this feature on compactor side?

Actually we are having same per metric retention requirement in our business scenario. We are trying to implement one policy based retention function to replace current compactor's retention function. Our idea is that providing one policy config file for retention function, the users can specify the promql expression to define the retention time for some metrics. The below is one sample policy config file:
policies:
  - expr: "{}"
    retentions:
      res-raw: 180d
      res-5m: 240d
      res-1h: 400d
  - expr: "{__name__=\"^go_memstats_.*\"}"
    retentions:
      res-raw: 90d
      res-5m: 180d
      res-1h: 360d
  - expr: "{__name__=\"go_memstats_gc_cpu_fraction\"}"
    retentions:
      res-raw: 200d
      res-5m: 300d
      res-1h: 400d
I would like to know your thoughts on this idea.

@bwplotka any thoughts on this idea to support per metric retention on compactor?

smalldirector on 12 Nov 2019

Extra context can be found here: https://github.com/prometheus/prometheus/issues/1381

bwplotka on 10 Dec 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 11 Jan 2020

This is being worked on on prometheus, once done there I would say we can implement it here with the same semantics and configuration.

brancz on 13 Jan 2020

👍4

The current plan is to first tackle https://github.com/thanos-io/thanos/issues/1598 then try to implement this. Putting this issue as GSoC project as well.

This is being worked on prometheus, once done there I would say we can implement it here with the same semantics and configuration.

The current decision is that Prometheus will not implement this, and the work has to be done first externally. It would be nice though if our work would work could be reused for vanilla Prometheus as well (as usual).

bwplotka on 29 Jan 2020

Hey @bwplotka, I would like to work on this and it would be very helpful if you could suggest me some resources to get started.

Jigar3 on 29 Jan 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

stale[bot] on 28 Feb 2020

We need that still (:

On Fri, 28 Feb 2020 at 16:10, stale[bot] notifications@github.com wrote:

This issue/PR has been automatically marked as stale because it has not
had recent activity. Please comment on status otherwise the issue will be
closed in a week. Thank you for your contributions.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/903?email_source=notifications&email_token=ABVA3O2LMUBFONYWRPI4Z2DRFEZOHA5CNFSM4G5CNURKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJBKWA#issuecomment-592581976,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3O26ETGQMQE54LYWLSLRFEZOHANCNFSM4G5CNURA
.

bwplotka on 28 Feb 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

stale[bot] on 29 Mar 2020

bump

On Sun, 29 Mar 2020 at 18:09, stale[bot] notifications@github.com wrote:

This issue/PR has been automatically marked as stale because it has not
had recent activity. Please comment on status otherwise the issue will be
closed in a week. Thank you for your contributions.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/903#issuecomment-605668065,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3OYBCS5CY5PVEF6Y5N3RJ56DNANCNFSM4G5CNURA
.

bwplotka on 30 Mar 2020

Hey @bwplotka I am interested in working on this issue for GSoC. Can you please help me get started with this issue/PR by suggesting any resources.

harshithachowdaryt on 31 Mar 2020

First, we need your proposal in the GSoC system. Have you got through the
official website proposal process? (:

On Tue, 31 Mar 2020 at 12:02, harshithachowdaryt notifications@github.com
wrote:

Hey @bwplotka https://github.com/bwplotka I am interested in working on
this issue for GSoC. Can you please help me get started with this issue/PR
by suggesting any resources.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/903#issuecomment-606556203,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3OYZNSJFEEXTUPLLW5TRKHEUNANCNFSM4G5CNURA
.

bwplotka on 31 Mar 2020

Yes, thanks for asking I just submitted the proposal :)

harshithachowdaryt on 31 Mar 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] on 30 Apr 2020

Kind of part of CS

On Thu, 30 Apr 2020 at 16:50, stale[bot] notifications@github.com wrote:

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or
needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we
can always reopen an issue if we need!). Alternatively, use remind command
https://probot.github.io/apps/reminders/ if you wish to be reminded at
some point in future.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/903#issuecomment-621939410,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3O4O7TEKMD2KVZ7QDHTRPGM3NANCNFSM4G5CNURA
.

bwplotka on 30 Apr 2020

CS?

brancz on 3 May 2020

Community Bridge, sorry (:

On Sun, 3 May 2020 at 10:26, Frederic Branczyk notifications@github.com
wrote:

CS?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/903#issuecomment-623081001,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3OZKQHD5V6U4NPJIVSDRPU2GFANCNFSM4G5CNURA
.

bwplotka on 4 May 2020

@bwplotka this will be very useful for us. Is this finalized to be part of community bridge? If finalized then we will wait. If not then I can contribute.

crsandeep on 5 May 2020

With the federation system we have in place now, we have trained users that want metrics to be preserved (federated) have to use a specific recording rule style name so that the metrics would get federated. We would love to continue this practice with Thanos and only retain metrics that meet a specific format for an extended period of time.

We have the same setup (specific recording rule regex to get your metric federated for long-term retention). When switching to Thanos, I settled for the following:

Have primary thanos rule instances generating recording rules from the sidecars into the main 15d/30d raw retention bucket, and sending alerts to alertmanager
Have a second set of thanos rule instances, call them long-term recording rule instances (only executing record: rules, not alert:) generating duplicate recording rules pushed into a long-term retention bucket (I picked raw 1y)

As a result, my Federation Prometheus pollers are replaced with a Thanos Query instance which is pointing to the long-term rule and bucket.

I haven't seen any problematic performance impact but there's probably extra calls to the sidecar with this solution.

If you have cpu/memory capacity to burn, I think you could set up recording rules + Thanos rule/store/query daemons set up like repeaters? relays? whatever I described above, to copy specific metrics into buckets with different retentions/compact instance settings.

sevagh on 21 May 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] on 20 Jun 2020

Still needed! :)

Harshitha1234 on 20 Jun 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] on 20 Jul 2020

Still valid... :)

lilic on 20 Jul 2020

👍3

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] on 19 Aug 2020

Still needed.

CNG on 19 Aug 2020

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] on 18 Oct 2020

Still valid :)

Harshitha1234 on 18 Oct 2020

Thanos: Allow setting retention per metric (e.g rule aggregation)

Most helpful comment

All 36 comments

Related issues