I spoke briefly with @mfreed on the slack channel yesterday regarding timescaledb and data retention. I wanted to open an issue here as a place to discuss this issue. It turns out that there has already been some thought put into this by the influxdb folks, and rather than trying to poorly paraphrase their work I'd like to just quote from influxdata/influxdb#7198:
The database should support more intelligent rollups and querying of
aggregated data. Currently, the only way to rollup data is through
manually setting up continuous queries and then manually modifying the
select statements to query that data which requires the user to know
which retention policies exist rather than it being discovered
automatically.
[...]
It should be simple for an administrator to setup rollups for an entire
database and users should not need knowledge of the rollups for them to
automatically start using them. Using rollups should be automatic and
performant.
Timescaledb suffers from the same basic problem, but doubly so because there is not (yet) any continuous-query functionality (although I have high hopes for #14).
The documentation alludes to data retention in the context of drop_chunks, but that's only half the issue. A typical data retention policy would include the rollup of historical data via some sort of aggregation function in order to reduce disk utilization (as in #285) and in order to increase query performance by reducing the number of rows involved in queries spanning longer time intervals. It's not terribly difficult to set this up via a scheduled job; for example, I'm using something like the following for the values table in this schema:
BEGIN;
INSERT INTO :dst_table
SELECT id, time_bucket(:'rollup_interval', tstamp) as tstamp_bucket,
name, avg(value) as value
FROM :src_table
WHERE age(current_timestamp, tstamp) > :'rollup_period'
GROUP BY tstamp_bucket, id, name
ORDER BY tstamp_bucket asc
ON CONFLICT DO NOTHING;
SELECT drop_chunks(interval :'rollup_period', :'src_table');
END;
I might use it like this:
psql -t --echo-all -v src_table=values -v dst_table=values_1min \
-v 'rollup_interval=1 min' -v 'rollup_period=1 day' \
-h stats.example.com -U postgres collectd
But while it works, there are some problems with this solution.
Every table in the database for which this sort of rollup is desired requires a distinct query.
Anything that was previously making queries against the values table now needs to be modified to also include the values_1min table. This gets linearly more complicated as one adds additional retention classes.
This is further complicated by the fact that SQL doesn't really have a good mechanism for concatenating tables. While there is the UNION/UNION ALL statement for concatenating queries, it yields large, unwieldy chunks of repeated code. What one really wants is SELECT * FROM table1 CONCAT table2 CONCAT table3.
The use of drop_chunks to remove rows can be problematic, particularly if the initial chunk time interval was large w/r/t the frequency with which one wants to roll up the data.
In an ideal world TimescaleDB would have first-class support for retention classes. In this model, I wouldn't need the above operation at all, and client code wouldn't need to know about any additional tables. Instead, timescaledb would take care of rolling up historical data, and queries spanning multiple retention classes would Just Work and would use whatever resolution was available for the given interval.
We at AgriConnect, a startup who build IoT-based solution for automatic farm, are looking forward to replace InfluxDB with TimescaleDB. But the missing of this functionality is the only roadblock that stop us from doing so.
In one of our products, we are running a time-series in an ARM board (BeagleBone), but InfluxDB is so a RAM hog.
From my perspective, ideally, combine the features/scenarios from TimescaleDB and pipelineDB together will be the best solution for my case. Imagining that pipelinedb features do JIT aggr and output raws from stream by rules, TimescaleDB for storage, usage and retention....but I don't know when exactly I can see the version can be run as extension for pipelinedb.
Is there any future plans for this functionality?
Can only wait for PipelineDB guys.
Correct me if I am wrong but I think the new add_drop_chunks_policy() and add_reorder_policy() functions should solve the data retention part of this issue.
Note: they are only available for the Enterprise license holders.
This is a feature I've used before in the Whisper database. It would be nice to have similar support in Timescale (the underlying implementation could simply be a continuous aggregate).
https://graphite.readthedocs.io/en/latest/whisper.html#rollup-aggregation
https://graphite.readthedocs.io/en/latest/whisper.html#multi-archive-storage-and-retrieval-behavior
so i guess this is enterprise only?
Hi, is there any way we can configure data to decrease resolution over time as a retention policy instead of dropping whole chunk of data?
Ah, haven't commented here in a while, but continuous aggregates as of v1.3 offer a lot of this function, and real-time aggregates in 1.7 allow for the transparent UNION between a cagg table and raw view.
Also, as of 1.7, data retention (and data reordering) have been moved to the community license (vs. enterprise).
@larsks realize this was from a while ago, but do these solve your initial asks?
Closing this out for now; I believe real-time aggregates supports this between raw and one level of aggregation. (We will leave to future if we introduce real-time merging between multiple aggregate classes.)
Most helpful comment
From my perspective, ideally, combine the features/scenarios from TimescaleDB and pipelineDB together will be the best solution for my case. Imagining that pipelinedb features do JIT aggr and output raws from stream by rules, TimescaleDB for storage, usage and retention....but I don't know when exactly I can see the version can be run as extension for pipelinedb.