Timescaledb: Proposal: support for exporting data before deleting it via drop_chunks

Created on 26 Jun 2018  Â·  16Comments  Â·  Source: timescale/timescaledb

As of now, TimescaleDB's drop_chunks provides an easy to use interface to delete the chunks that are entirely before a given time. Note that this is not about deleting all the data (rows) before the given time. Rather, drop_chunks allows deleting the chunks whose time window is before the specified point (i.e., based on the intervals that can be configured during hypertable creation).

This is useful performance-wise, as deleting chunks are basically just deleting an entire file from disk, while deleting individual rows goes through the entire MVCC process and needs to be later garbage collected through VACUUMing.

However, a common use-case is to dump such data into some kind of cold storage before deleting. For this, there is currently no easy-to-use API in Timescale. This can be achieved manually using _timescaledb_catalog.chunk table. The following is a proposal to integrate such functionality in Timescale API by adding the following two functions:

  • show_chunks - takes a hypertable along with a time indicator and returns the list of chunk-tables representing chunks that ended before the given time. Note that drop_chunks would use show_chunks to figure out which chunks need to be dropped. This means that the user could use show_chunks to find out which chunks would be affected by drop_chunks (as a sanity check or for some other reason).
  • export_chunks - takes the parameters above along with a file format string and exports CSV(s) into the location specified by format string

Below are some use-cases along with the resulting SQL:

  • Export the data older than a week in one csv file without deleting it
SELECT export_chunks(test_table, now() - interval ’1 week’, 
‘/tmp/%hypertable%_%timestamp%.csv’,
 csv_per_chunk=FALSE);
  • Export the data older then one week in csv files (one per chunk) and delete it
BEGIN;
SELECT export_chunks(test_table, now() - interval ’1 week’, 
‘/tmp/%hypertable%_%chunk%.csv’);
SELECT drop_chunks(now() - interval ’1 week’, ‘test_table’);
COMMIT;

And this is the exposed API:

--hypertable_name - name of the timescale table
--older_than - a time indicator constraint defining the end of 
--  time range window from which chunks are selected 
--returns - a column containing chunk tables that satisfy the constraint.
FUNCTION show_chunks (
     hypertable_name REGCLASS,
     older_than INTERVAL = NULL,
)  RETURNS SETOF REGCLASS

--hypertable_name - name of the timescale table
--older_than - a time indicator constraint defining the end of
--  time range window from which chunks are selected 
--format - formatting of file name(s) for the output.
--  The value needs to be an absolute file path 
--  with the following variables supported:
--  %hypertable_name%, %chunk_name%, %chunk_starttime%, %chunk_endtime%, 
--  %epoch%, %timestamp%
--overwrite - if true, will overwrite existing files
--csv_per_chunk - if true, will produce one csv file per chunk table, 
--  otherwise, will produce a single csv file for all of the data. 
--  Note that format should have a chunk identifying variable 
--  if and only if csv_per_chunk is set to true
FUNCTION export_chunks(
     hypertable_name REGCLASS,
     older_than INTERVAL = NULL,
     format TEXT,
     overwrite BOOLEAN=FALSE,
     csv_per_chunk BOOLEAN=TRUE
)  RETURNS VOID

Note that export_chunks would perform a server side copy.
Client side copy (and other desired business specific export functionality) can be achieved with the help of a simple python script and show_chunks function.
As an example, this is how chunks could be exported to separate files.

import psycopg2
import os

conn = psycopg2.connect("dbname=postgres user=myuser")

cur = conn.cursor()
cur.execute("SELECT show_chunks(test_table, now() - interval ’1 week’);")
records = cur.fetchall();
for record in records:
    chunk = record[0]
    path = "/tmp/{hypertable}/{chunk}_dump.csv"
                                    .format(hypertable="test_table",chunk=chunk)

    if (os.path.isfile(path)):
        exit("file %s exists" % path)

    if not os.path.exists(os.path.dirname(path)):
        os.makedirs(os.path.dirname(path))
        # optionally remove already exported files 
    with open(path, 'w+') as dest:
        cur.copy_expert("COPY {chunk} TO \'{path}\' WITH CSV HEADER"
                                    .format(chunk=chunk, path=path), dest)
cur.execute("SELECT drop_chunks(now() - interval ’1 week’, 'test_table')")
conn.commit()
cur.close()
conn.close()
Data Retention community-request enhancement feedback-wanted

Most helpful comment

Support for a workflow like this would be very helpful:

  1. Backup data older than e.g., 7 days, i.e., similar to "drop_chunks()" but instead of dropping them - back them up
  2. Drop the same data that was just backed up, i.e., data older than 7 days - the "drop_chunks()" would do this
  3. At a later time, when needed, restore one or more of the backed up chunks

So this way, I could e.g., create daily backups of data that is older than 7 days and limit the active database to just 7 days worth of data. Then, when I need to look at older data for troubleshooting / analysis / or something else, I can restore one or more of the daily chunk backups to another instance of TimescaleDB to work with as needed.

All 16 comments

Without having much of substance to add, I'll say this looks like a suitable & desirable feature as described.

This actually looks really good - you could use show_chunks to do data aging to tablespaces hosted on slower disk (or even to other servers via foreign tables) as well.

This can be really useful for me, and I have an additional requirement for a newer_than filter!

@eduardotsj interesting, do you mind providing more detail on the use case for your newer_than filter?

This is my main use case to the export chunks with a newer_than option:

  • We deploy our solution on premises and when we need to setup a good dev DB and mainly when we have to do complex bug analysis we need to bring a DB copy to our office over poor network connection. Our current DB is already about 250GB so it's impossible to download on viable time. So we dump the reference tables (excluding schema _timescaledb_internal) and then dump the newest chunks individually. We get the list using this query:
    SELECT chunk_id, chunk_table, to_timestamp(lower(ranges[1])/1000000) as range_start, to_timestamp(upper(ranges[1])/1000000) as range_end
    FROM chunk_relation_size('{Hypertable}') where to_timestamp(upper(ranges[1])/1000000) > now();
    For debuging old data usually is not relevant.

@eduardotsj excellent, that makes perfect sense. Thank you for the additional context. I'll mark you down as being interested in this feature, and update you as we have more progress.

Export the data older then one week in csv files (one per chunk) and delete it

What happens if changes are made to the chunk while it's being exported? Would updates be lost when calling drop_chunks?

I'd only use this if drop_chunks supported an AS OF SYSTEM TIME clause or some way to guarantee any updates during the export are not lost.

Related to #285, #350, #563, #642.

@alanhamlett We are now actively working on this feature. Regarding what happens when changes are made to drop_chunks, we are thinking of support transactional semantics, so any changes that occur on a chunk actively being dropped will be rejected. How does this approach sound?

That would work for me, since the application layer could handle any write failures gracefully.

Wonderful, thank you for the feedback @alanhamlett! Will reping here once I have more updates.

Not implemented yet, so re-opening.

@dianasaur323 is there an estimate timeline for this feature? Thanks!

Support for a workflow like this would be very helpful:

  1. Backup data older than e.g., 7 days, i.e., similar to "drop_chunks()" but instead of dropping them - back them up
  2. Drop the same data that was just backed up, i.e., data older than 7 days - the "drop_chunks()" would do this
  3. At a later time, when needed, restore one or more of the backed up chunks

So this way, I could e.g., create daily backups of data that is older than 7 days and limit the active database to just 7 days worth of data. Then, when I need to look at older data for troubleshooting / analysis / or something else, I can restore one or more of the daily chunk backups to another instance of TimescaleDB to work with as needed.

@pv97 apologies for taking a while to respond. we are actually working on a different feature in this release, so no specific timeline yet. For now, you'll have to go with a more manual approach.

In 2.0 you will be able to implement this with a user-defined action.

Was this page helpful?
0 / 5 - 0 ratings