Influxdb: [feature request] auto choose retention policies based on timestamp when querying

Created on 21 May 2015  Â·  22Comments  Â·  Source: influxdata/influxdb

Now, select chooses the default retention policy if not specified. It will be better if select auto choose retention policies for the same series based on timestamp when the select statement does not specify retention policy. This will simplify dashboard tools. This is what graphite already does.

If we have to change retention policy when we want older data, it is tedious because we have to edit the dashboard definition.

When we select from a series, we do not care what retention policies it has, we just want the datapoints.

RFC arequeries kinfeature-request

Most helpful comment

@beckettsean, this is certainly from our POV a fairly important feature and I dont think your workaround really works. Let me give an example of the problem we have - we capture some data every second. Lets say IO blocks out (which is in telegraf). You need data at 1 second granularity for some types of troubleshooting, but in most cases on a graph it would be crazy to worry about 1 second data.

Lets imagine a common case - a dashboard showing all metrics per server. It might default to show the last hour (3600 data points/server). Per day 86400, per month >2.5 million points. Per hour and perhaps per day will just work but nobody in their right mind would attempt to keep metrics at 1 second granularity over a year and then graph them - while InfluxDB can downsample them its going to have to pull a crazy number of metrics from disk for that query (and, in the real world, it would likely be >1 server on a graph; we also have plans to store data in some cases at a small number of microseconds delta). We also have a basic disk space problem - we are already capturing many hundreds of GB of 1s and 10s metrics per day.

The sane pattern is to keep 1 second for 24 hours, 1 minute for a week, 5 minute for a month and once an hour for a year (or something similar). This is how just about every other system (graphite, Ganglia, etc.) handle it. This we can sort of do with a Continuous query in InfluxDB, to copy the down sampled data to a new database (although we have to delete the 1 second data manually). The problem is now we have a Grafana problem - we can only query either downsampled, or original data, from a single graph. This means that a user who looks at a 1 hour graph (1 second granularity) then zooms out to see the last month and we have to change the database. Which Grafana does not support.

There are two ways to approach this:

  1. Teach Grafana about this concept (preferably somehow auto-learning that downsampled data exists in this other place, although more realistically defining that in Grafana)
  2. Provide a single "view" inside InfluxDB that merges the various levels of data
  3. Provide a way to down-sample data in InfluxDB after a certain period of time, sort of like a contiguous down-sampling job.

My personal preference would be (3), but I suspect thats not an architectural starter (although if you would be willing to accept that as an option, we might be able to find somebody to work on it and send you a PR). This leaves us with (1) or (2). THis ticket strikes me as asking for (1). DO you think its best to attack this via means of this, or to track a issue more like (2) (for the InfluxDB project)

cc @sebito91, @wrigtim

All 22 comments

A time range alone is not sufficient to identify which retention policy is desired. There is nothing to prevent two series with identical measurement name and tag sets from existing in separate retention policies with overlapping time ranges. Therefore it is not possible for the system to know which series is intended if the retention policy is not provided.

A workaround for now is to keep all data for a given dashboard in the same retention policy. It does require maintaining multiple dashboards.

@beckettsean, this is certainly from our POV a fairly important feature and I dont think your workaround really works. Let me give an example of the problem we have - we capture some data every second. Lets say IO blocks out (which is in telegraf). You need data at 1 second granularity for some types of troubleshooting, but in most cases on a graph it would be crazy to worry about 1 second data.

Lets imagine a common case - a dashboard showing all metrics per server. It might default to show the last hour (3600 data points/server). Per day 86400, per month >2.5 million points. Per hour and perhaps per day will just work but nobody in their right mind would attempt to keep metrics at 1 second granularity over a year and then graph them - while InfluxDB can downsample them its going to have to pull a crazy number of metrics from disk for that query (and, in the real world, it would likely be >1 server on a graph; we also have plans to store data in some cases at a small number of microseconds delta). We also have a basic disk space problem - we are already capturing many hundreds of GB of 1s and 10s metrics per day.

The sane pattern is to keep 1 second for 24 hours, 1 minute for a week, 5 minute for a month and once an hour for a year (or something similar). This is how just about every other system (graphite, Ganglia, etc.) handle it. This we can sort of do with a Continuous query in InfluxDB, to copy the down sampled data to a new database (although we have to delete the 1 second data manually). The problem is now we have a Grafana problem - we can only query either downsampled, or original data, from a single graph. This means that a user who looks at a 1 hour graph (1 second granularity) then zooms out to see the last month and we have to change the database. Which Grafana does not support.

There are two ways to approach this:

  1. Teach Grafana about this concept (preferably somehow auto-learning that downsampled data exists in this other place, although more realistically defining that in Grafana)
  2. Provide a single "view" inside InfluxDB that merges the various levels of data
  3. Provide a way to down-sample data in InfluxDB after a certain period of time, sort of like a contiguous down-sampling job.

My personal preference would be (3), but I suspect thats not an architectural starter (although if you would be willing to accept that as an option, we might be able to find somebody to work on it and send you a PR). This leaves us with (1) or (2). THis ticket strikes me as asking for (1). DO you think its best to attack this via means of this, or to track a issue more like (2) (for the InfluxDB project)

cc @sebito91, @wrigtim

A different way of solving is is to build a proxy between grafana and influx (we need this anyway to check user acces). Parse out the group by, measurement name and agregate of the incomming request at the proxy and apply a rule to change the measurement name (prepend a retention or a custom string fitting your data structure). Send this query to influx instead of the original. I think this is the most pratical solution at the moment.

@PaulKuiper, funnily enough thats exactly how we plan to achieve this (we also have the ACL problem).

Have you already worked on this? We may build this and open source it... or use somebody elses's if its already out there.

Attached is a python file (in txt format, else I could not upload it), which you can use as a simple proxy between grafana and influx.

It can greatly increase zoom speed. It assumes that the following continous queries are present for the measurement called "metric" :

metric.1s.max
metric.1m.max
metric.1h.max
metric.1d.max
metric.1h.mean
......

Point your "data source" to port 3004 (or whatever you choose) instead of port 8086 in Grafana.
The proxy will now change your query transparantly by choosing a different table when zooming out.
select max(value) from "metric" where time > x1 and time < x2 group by time(12h)
becomes:
select max(value) from "metric.1h.max" where time > x1 and time < x2 group by time(12h)

poxy.txt

+1

+1

+1 !!
any feedback about PaulKuiper workaround ?

I'll update it for influxdb 0.10 somewhere this month

+1

:+1:

I've done some work on @PaulKuiper proxy to work with 0.10 and put it here:
https://github.com/Lupul/influxdb-grafana-rp-proxy

With version 0.10 it's normal to have several values in each measurement.
In the proxy readme the CQs are just for only one value (called _value_)

CREATE CONTINUOUS QUERY graphite_cq_10sec  ON graphite BEGIN SELECT mean(value) as value INTO graphite."10sec".:MEASUREMENT  FROM graphite."default"./.*/ GROUP BY time(10s), * END

Any ideas how to handle where there are several values?

I was thinking in some batch processing with kapacitor which obtains all values for each measurement, and creates the appropriates CQs.

I have make a small script to autogenerate RPs and CQs: https://gist.github.com/f4b6f5c8f6c2a51c3f60

@adrianlzt in CQs each tag or field must be explicitly named. It is possible to use SELECT * to return all columns from an ad hoc query, but not in a CQ, as there is no aggregation function.

So, to downsample multiple fields, the CQ would look something like this:

CREATE CONTINUOUS QUERY graphite_cq_10sec  
ON graphite BEGIN 
SELECT mean(value) as value, last(value) as last, mean(value_23) as value_23, top(field19) as top
INTO graphite."10sec".:MEASUREMENT  
FROM graphite."default"./.*/ 
GROUP BY time(10s), * 
END

The GROUP BY * clause means that each downsampled value would be stored in a series with the same tag set as the original series. So, while the tags aren't explicitly queried, they will still be part of the downsampled series. Without the GROUP BY * clause above, all tags would be lost during downsampling. It is possible to name explicit tags in the GROUP BY, and then only those tags would be preserved.

Will be this fixed in next versions?
It is hard to maintain downsampling in multiple databases with lots of
series with multiple values.

El vie., 4 de marzo de 2016 2:20, Sean Beckett [email protected]
escribió:

@adrianlzt https://github.com/adrianlzt in CQs each tag or field must
be explicitly named. It is possible to use SELECT * to return all columns
from an ad hoc query, but not in a CQ, as there is no aggregation function.

So, to downsample multiple fields, the CQ would look something like this:

CREATE CONTINUOUS QUERY graphite_cq_10sec
ON graphite BEGIN
SELECT mean(value) as value, last(value) as last, mean(value_23) as value_23, top(field19) as top
INTO graphite."10sec".:MEASUREMENT
FROM graphite."default"./.*/
GROUP BY time(10s), *
END

—
Reply to this email directly or view it on GitHub
https://github.com/influxdata/influxdb/issues/2625#issuecomment-192046509
.

Follow https://github.com/influxdata/influxdb/issues/5750, which is the
relevant issue.

On Fri, Mar 4, 2016 at 9:13 AM, Adrián López [email protected]
wrote:

Will be this fixed in next versions?
It is hard to maintain downsampling in multiple databases with lots of
series with multiple values.

El vie., 4 de marzo de 2016 2:20, Sean Beckett [email protected]
escribió:

@adrianlzt https://github.com/adrianlzt in CQs each tag or field must
be explicitly named. It is possible to use SELECT * to return all columns
from an ad hoc query, but not in a CQ, as there is no aggregation
function.

So, to downsample multiple fields, the CQ would look something like this:

CREATE CONTINUOUS QUERY graphite_cq_10sec
ON graphite BEGIN
SELECT mean(value) as value, last(value) as last, mean(value_23) as
value_23, top(field19) as top
INTO graphite."10sec".:MEASUREMENT
FROM graphite."default"./.*/
GROUP BY time(10s), *
END

—
Reply to this email directly or view it on GitHub
<
https://github.com/influxdata/influxdb/issues/2625#issuecomment-192046509>
.

—
Reply to this email directly or view it on GitHub
https://github.com/influxdata/influxdb/issues/2625#issuecomment-192362529
.

Sean Beckett
Director of Support and Professional Services
InfluxDB

+1 for this issue. Without good data down-sampling / roll up, moving from whisper is likely to present problems for graphite graphs over long time periods. Ideally needs a way to set a default policy around rollup and retention, for all metrics of a specific type.

+1

Coming from the old RRD world, this is obviously a big shift in mentality, I fully agree with https://github.com/influxdata/influxdb/issues/2625#issuecomment-159263547 - option 3

Then I never used graphite but read about it several times and I liked how you can set different retention policies per metrics if desired otherwise you get the default downsampling automatically.

The influxdb approach is awkward and makes it hard to maintain in my beginner's opinion.

I believe users want a time series databases that is efficient, fast and requires low maintenance. Managing RP and CQ with complex InfluxQL queries isn't obvious...

My comments maybe irrelevant, I am still learning and reading mailing-list and github issues to figure out how I can set downsampling. Currently, I start to wonder why I waited so long for InfluxDB instead of just using Dixon's Graphite.

Still, influxdb is fast to spin and play with the tutorial but then it gets more complicated when you really want to do something with it.

Looking through old issues and I found this one. It seems related to #6910.

Agreed. I think this one can basically be closed as a dup of #6910

I'm going to close this in favor of #6910. If a new issue gets created for this, it will be mentioned in that issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Raniz85 picture Raniz85  Â·  3Comments

jayannah picture jayannah  Â·  3Comments

ricco24 picture ricco24  Â·  3Comments

deepujain picture deepujain  Â·  3Comments

shilpapadgaonkar picture shilpapadgaonkar  Â·  3Comments