Thanos: receive: Create proposal for backfiling on remote write

Created on 13 May 2020  路  11Comments  路  Source: thanos-io/thanos

As per our discussion here we decided to enable Remote Write backfilling.

Thoughts so far:

  • Use cases: Lagging behind clusters (currently only 2h is allowed, same with Cortex), clock skew for clusters, forgotten metrics, a batch job with old dataset, monitoring remote site. Artificial data.
  • Something inefficient for a start is ok. We can aim for rare cases only. Someday as something continuous.
  • Easiest would be to just open new TSDB on this case.

Help wanted, but the topic is extremely difficult. The design must-have apriori.

cc @gouthamve @pracucci @brancz @richiH, @tomwilkie @squat

receive hard feature request / improvement help wanted stale

Most helpful comment

For those who found this issue via search: thanks to @bwplotka and @dipack95 work on prometheus side it is now possible to import custom data in prometheus text format into Thanos via https://github.com/sepich/thanos-kit/ import command.
(It is currently reading all the imported data to memory, hope the feature would be implemented better in upstream version. Unfortunately we need this "yesterday" so can't wait anymore ;)

All 11 comments

As the discussion https://github.com/prometheus/prometheus/issues/535 evolved to something quite complex (but I see it isn't so easy, initially I though that this was a Prometheus feature that I just didn't knew about :smile: ), would this enhancement be a Thanos ad hoc solution to backfill historical data into Prometheus?

I'm building an analytical application that reads historical usage data from customers to do some calculations and identify potential optimizations based on this historical data set of metrics. We're writing some exporters, and we wouldn't mind to write directly any Prometheus file format, or push into a Thanos endpoint that can backfill data, but I see no docs on how to do this "manually". Any references here that can help me to understand how to do this? Something using gRPC client streaming would be very nice :heart:! But this is an implementation detail that we can do by ourselves.

Anyway, just to be in sync with the community priorities, we can live without backfiling for the next months as we're currently developing features and researching solutions, but can I expect the ETA to be at most at the end of this year? If you guys need, I can help to implement once you've decided how the architecture should be!

And thanks for proactively open this issue, I think this is a really good way to start solving the backfill issue that the community want so much judging by the number of reacts and discussions in related threads to this subject.

Nice! Welcome to the community :wave:

So actually you might be interested in those discussions:

We can't promise anything, but it looks like some backfilling options are coming pretty soon ! (: Currently it's possible, but it requires some magic in Go (:

Hello 馃憢 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 馃
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Closing for now as promised, let us know if you need this to be reopened! 馃

This use case is very important for us since even some normal amount of upstream metrics batch ingest will easily result in metrics with over an hour but within 2 hours of lag. The linked discussion in Cortex is very helpful, but ultimately doesn't help until it is implemented here in Thanos :) Can we have this issue re-opened so it doesn't get forgotten?

This has also been discussed at the Prometheus dev summit. Once Prometheus implements it, we should probably follow the same strategy in Thanos. I believe it's unlikely that that will happen in the receive component, but that's a detail I believe.

Hello. Not quite sure if this issue is exactly the right one, but coming from https://github.com/thanos-io/thanos/issues/2490 I have an use case. Please tell me, if there is an ticket that fits better.

We are currently implementing an anomaly detection with Thanos. It works well so far, but it is a quite complex query which needs data from up to 4 weeks ago. Due to the complexity it is more readable and has a better performance when reusing calculations by intermediate recording rules. Of course we do not have data from 4 weeks ago yet, because we just started to write the recording rules. Therefore we would need to wait full 4 weeks to be sure that the rules are working properly.

With backfilling we might be able to retrospectively calculate the recording rules and see the result immediately.

Of course we can inline most of the queries and skip the immediate recording rules, but this requires a lot of resources. Also it is impossible, when using a counter which does not have a rate recording rule yet, because something like avg_over_time(rate(foobar[5m])[1w] does not work.


Here is an example rule file:

The version without intermediate recording rules

groups:
  - name: anomaly-detection-1m
    interval: 30s
    rules:
    - record: rule_action:wafsc_evaluations:rate1m:seasonal_prediction
      expr: >
        quantile(0.5,
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 166h)
            + avg_over_time(rule_action:wafsc_evaluations:rate1m[1w])
            - avg_over_time(rule_action:wafsc_evaluations:rate1m[1w] offset 1w)
            , "offset", "1w", "", "")
          or
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 334h)
            + avg_over_time(rule_action:wafsc_evaluations:rate1m[1w])
            - avg_over_time(rule_action:wafsc_evaluations:rate1m[1w] offset 2w)
            , "offset", "2w", "", "")
          or
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 502h)
            + avg_over_time(rule_action:wafsc_evaluations:rate1m[1w])
            - avg_over_time(rule_action:wafsc_evaluations:rate1m[1w] offset 3w)
            , "offset", "3w", "", "")
        ) without (offset)

    - record: rule_action:wafsc_evaluations:rate1m:seasonal_prediction:z_score
      expr: >
        (
          rule_action:wafsc_evaluations:rate1m
          - rule_action:wafsc_evaluations:rate1m:seasonal_prediction
        ) / stddev_over_time(rule_action:wafsc_evaluations:rate1m[1w])

The optimized version

groups:
  - name: anomaly-detection-1m
    interval: 30s
    rules:
    - record: rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
      expr: avg_over_time(rule_action:wafsc_evaluations:rate1m[1w])

    - record: rule_action:wafsc_evaluations:rate1m:stddev_over_time_1w
      expr: stddev_over_time(rule_action:wafsc_evaluations:rate1m[1w])

    - record: rule_action:wafsc_evaluations:rate1m:z_score
      expr: >
        (
          rule_action:wafsc_evaluations:rate1m -
          rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
        ) / rule_action:wafsc_evaluations:rate1m:stddev_over_time_1w

    - record: rule_action:wafsc_evaluations:rate1m:seasonal_prediction
      expr: >
        quantile(0.5,
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 166h)
            + rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
            - rule_action:wafsc_evaluations:rate1m:avg_over_time_1w offset 1w
            , "offset", "1w", "", "")
          or
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 334h)
            + rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
            - rule_action:wafsc_evaluations:rate1m:avg_over_time_1w offset 2w
            , "offset", "2w", "", "")
          or
          label_replace(
            avg_over_time(rule_action:wafsc_evaluations:rate1m[4h] offset 502h)
            + rule_action:wafsc_evaluations:rate1m:avg_over_time_1w
            - rule_action:wafsc_evaluations:rate1m:avg_over_time_1w offset 3w
            , "offset", "3w", "", "")
        ) without (offset)

    - record: rule_action:wafsc_evaluations:rate1m:seasonal_prediction:z_score
      expr: >
        (
          rule_action:wafsc_evaluations:rate1m
          - rule_action:wafsc_evaluations:rate1m:seasonal_prediction
        ) / rule_action:wafsc_evaluations:rate1m:stddev_over_time_1w

There is work on retroactive rule evaluations happening on Prometheus already. Once that's figured out there we'll probably implement the same mechanism in Thanos. We look at backfilling data more as a thing for retrofitting non Prometheus data into the system, retroactive rule evaluations may make use of the same infrastructure but should be a first class feature, at least eventually.

Hello 馃憢 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 馃
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Closing for now as promised, let us know if you need this to be reopened! 馃

For those who found this issue via search: thanks to @bwplotka and @dipack95 work on prometheus side it is now possible to import custom data in prometheus text format into Thanos via https://github.com/sepich/thanos-kit/ import command.
(It is currently reading all the imported data to memory, hope the feature would be implemented better in upstream version. Unfortunately we need this "yesterday" so can't wait anymore ;)

Was this page helpful?
0 / 5 - 0 ratings