Thanos: Allow setting retention per metric (e.g rule aggregation)

Created on 11 Mar 2019  Β·  36Comments  Β·  Source: thanos-io/thanos

In ideal world retention is not necessary on downsampling/raw leve, but on aggregation level.

We need to have the way to bring it through in LTS system like Thanos.

AC:

  • User can specify certain aggregation to be retained longer than others
  • Design doc (proposal is in place)

It cames to the fact that you want per metric retention ideally on compactor. This is bit related to delete_series as it might involve block rewrite in edge case... We need to design this.

Thoughts @improbable-ludwik @brancz @devnev @domgreen

Most helpful comment

This is being worked on on prometheus, once done there I would say we can implement it here with the same semantics and configuration.

All 36 comments

There have been discussions on per time-series retention on upstream Prometheus before, I think at least having a discussion with the team is worth it, just to see if there are any insights from back then.

proposal is in place

What do you mean by this? As far as I can tell there is no design written up anywhere, but may very well have missed it.

Off the top of my head, this could be a configuration which is a combination of a Prometheus style label selector, as well a respective rule of which resolution to keep for how long.

As a whole this is definitely not trivial, but I agree much needed.

Sorry - no, proposal has to be in place, that's what I meant.

(: Relabel-like config makes sense but essentially we are talking about rewrite in compactor for this right?

Configuration is a technicality, I'm not entirely sure relabelling would work exactly but something close to that probably yes. I agree the compactor is the component that needs to take care of this by re-writing blocks.

With the federation system we have in place now, we have trained users that want metrics to be preserved (federated) have to use a specific recording rule style name so that the metrics would get federated. We would love to continue this practice with Thanos and only retain metrics that meet a specific format for an extended period of time.

Some offline discussions revealed that users still do it with Thanos, but using federation + Thanos on top.

The point was to shard ingestion and execute recording rules against each sharded ingestion gateway (which runs with minimal retention).  We have some very high-cardinality metrics and centralized ingestion alone was prohibitively expensive.  That prometheus federation layer allows us to compute/ingest aggregates (without retaining the raw metrics).

I think we should aim to allow users to avoid this, but cannot see immdiately blocker for that, otherwise than more complex system and query being able to fetch data with some lag (rule eval lag + federated scrape)

Just saw these discussion threads on potential feature requirement per metric retention on compactor, do we still look for this feature on compactor side?

Actually we are having same per metric retention requirement in our business scenario. We are trying to implement one policy based retention function to replace current compactor's retention function. Our idea is that providing one policy config file for retention function, the users can specify the promql expression to define the retention time for some metrics. The below is one sample policy config file:

policies:
  - expr: "{}"
    retentions:
      res-raw: 180d
      res-5m: 240d
      res-1h: 400d
  - expr: "{__name__=\"^go_memstats_.*\"}"
    retentions:
      res-raw: 90d
      res-5m: 180d
      res-1h: 360d
  - expr: "{__name__=\"go_memstats_gc_cpu_fraction\"}"
    retentions:
      res-raw: 200d
      res-5m: 300d
      res-1h: 400d

I would like to know your thoughts on this idea.

Just an FYI - not retaining raw data will lead to problems (you won't be able to zoom into your metrics anymore): https://thanos.io/components/compact.md/#downsampling-resolution-and-retention:

In other words, if you set --retention.resolution-raw less then --retention.resolution-5m and --retention.resolution-1h - you might run into a problem of not being able to β€œzoom in” to your historical data.

@wogri I think you are using grafana for visualization. Checkout PR https://github.com/grafana/grafana/pull/19121. At the moment this PR is not in a grafana release, therefore I use the master.
I added three prometheus datasources, each with an other max_source_resolution parameter. Now I can switch to the best resolution and zoom into my metrics. I disabled also auto-downsampling in thanos query component, because it doesn't really work good.
If you are using rates you should make your interval flexible, because on resolution one hour you should set the timerange to two hours.
I think, I will write a thanos PR with some more explanation, if grafana is released with the above PR.

Thanks @Reamer!

Just saw these discussion threads on potential feature requirement per metric retention on compactor, do we still look for this feature on compactor side?

Actually we are having same per metric retention requirement in our business scenario. We are trying to implement one policy based retention function to replace current compactor's retention function. Our idea is that providing one policy config file for retention function, the users can specify the promql expression to define the retention time for some metrics. The below is one sample policy config file:

policies:
  - expr: "{}"
    retentions:
      res-raw: 180d
      res-5m: 240d
      res-1h: 400d
  - expr: "{__name__=\"^go_memstats_.*\"}"
    retentions:
      res-raw: 90d
      res-5m: 180d
      res-1h: 360d
  - expr: "{__name__=\"go_memstats_gc_cpu_fraction\"}"
    retentions:
      res-raw: 200d
      res-5m: 300d
      res-1h: 400d

I would like to know your thoughts on this idea.

@bwplotka any thoughts on this idea to support per metric retention on compactor?

Extra context can be found here: https://github.com/prometheus/prometheus/issues/1381

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This is being worked on on prometheus, once done there I would say we can implement it here with the same semantics and configuration.

The current plan is to first tackle https://github.com/thanos-io/thanos/issues/1598 then try to implement this. Putting this issue as GSoC project as well.

This is being worked on prometheus, once done there I would say we can implement it here with the same semantics and configuration.

The current decision is that Prometheus will not implement this, and the work has to be done first externally. It would be nice though if our work would work could be reused for vanilla Prometheus as well (as usual).

Hey @bwplotka, I would like to work on this and it would be very helpful if you could suggest me some resources to get started.

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

We need that still (:

On Fri, 28 Feb 2020 at 16:10, stale[bot] notifications@github.com wrote:

This issue/PR has been automatically marked as stale because it has not
had recent activity. Please comment on status otherwise the issue will be
closed in a week. Thank you for your contributions.

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/903?email_source=notifications&email_token=ABVA3O2LMUBFONYWRPI4Z2DRFEZOHA5CNFSM4G5CNURKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJBKWA#issuecomment-592581976,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3O26ETGQMQE54LYWLSLRFEZOHANCNFSM4G5CNURA
.

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

bump

On Sun, 29 Mar 2020 at 18:09, stale[bot] notifications@github.com wrote:

This issue/PR has been automatically marked as stale because it has not
had recent activity. Please comment on status otherwise the issue will be
closed in a week. Thank you for your contributions.

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/903#issuecomment-605668065,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3OYBCS5CY5PVEF6Y5N3RJ56DNANCNFSM4G5CNURA
.

Hey @bwplotka I am interested in working on this issue for GSoC. Can you please help me get started with this issue/PR by suggesting any resources.

First, we need your proposal in the GSoC system. Have you got through the
official website proposal process? (:

On Tue, 31 Mar 2020 at 12:02, harshithachowdaryt notifications@github.com
wrote:

Hey @bwplotka https://github.com/bwplotka I am interested in working on
this issue for GSoC. Can you please help me get started with this issue/PR
by suggesting any resources.

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/903#issuecomment-606556203,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3OYZNSJFEEXTUPLLW5TRKHEUNANCNFSM4G5CNURA
.

Yes, thanks for asking I just submitted the proposal :)

Hello πŸ‘‹ Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€—
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Kind of part of CS

On Thu, 30 Apr 2020 at 16:50, stale[bot] notifications@github.com wrote:

Hello πŸ‘‹ Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or
needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€—
If there will be no activity for next week, this issue will be closed (we
can always reopen an issue if we need!). Alternatively, use remind command
https://probot.github.io/apps/reminders/ if you wish to be reminded at
some point in future.

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/903#issuecomment-621939410,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3O4O7TEKMD2KVZ7QDHTRPGM3NANCNFSM4G5CNURA
.

CS?

Community Bridge, sorry (:

On Sun, 3 May 2020 at 10:26, Frederic Branczyk notifications@github.com
wrote:

CS?

β€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/903#issuecomment-623081001,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3OZKQHD5V6U4NPJIVSDRPU2GFANCNFSM4G5CNURA
.

@bwplotka this will be very useful for us. Is this finalized to be part of community bridge? If finalized then we will wait. If not then I can contribute.

With the federation system we have in place now, we have trained users that want metrics to be preserved (federated) have to use a specific recording rule style name so that the metrics would get federated. We would love to continue this practice with Thanos and only retain metrics that meet a specific format for an extended period of time.

We have the same setup (specific recording rule regex to get your metric federated for long-term retention). When switching to Thanos, I settled for the following:

  • Have primary thanos rule instances generating recording rules from the sidecars into the main 15d/30d raw retention bucket, and sending alerts to alertmanager
  • Have a second set of thanos rule instances, call them long-term recording rule instances (only executing record: rules, not alert:) generating duplicate recording rules pushed into a long-term retention bucket (I picked raw 1y)

As a result, my Federation Prometheus pollers are replaced with a Thanos Query instance which is pointing to the long-term rule and bucket.

I haven't seen any problematic performance impact but there's probably extra calls to the sidecar with this solution.

If you have cpu/memory capacity to burn, I think you could set up recording rules + Thanos rule/store/query daemons set up like repeaters? relays? whatever I described above, to copy specific metrics into buckets with different retentions/compact instance settings.

Hello πŸ‘‹ Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€—
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Still needed! :)

Hello πŸ‘‹ Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€—
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Still valid... :)

Hello πŸ‘‹ Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€—
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Still needed.

Hello πŸ‘‹ Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! πŸ€—
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Still valid :)

Was this page helpful?
0 / 5 - 0 ratings