Timescaledb: Proposal: support for exporting data before deleting it via drop_chunks

Created on 26 Jun 2018 · 16Comments · Source: timescale/timescaledb

As of now, TimescaleDB's drop_chunks provides an easy to use interface to delete the chunks that are entirely before a given time. Note that this is not about deleting all the data (rows) before the given time. Rather, drop_chunks allows deleting the chunks whose time window is before the specified point (i.e., based on the intervals that can be configured during hypertable creation).

This is useful performance-wise, as deleting chunks are basically just deleting an entire file from disk, while deleting individual rows goes through the entire MVCC process and needs to be later garbage collected through VACUUMing.

However, a common use-case is to dump such data into some kind of cold storage before deleting. For this, there is currently no easy-to-use API in Timescale. This can be achieved manually using _timescaledb_catalog.chunk table. The following is a proposal to integrate such functionality in Timescale API by adding the following two functions:

show_chunks - takes a hypertable along with a time indicator and returns the list of chunk-tables representing chunks that ended before the given time. Note that drop_chunks would use show_chunks to figure out which chunks need to be dropped. This means that the user could use show_chunks to find out which chunks would be affected by drop_chunks (as a sanity check or for some other reason).
export_chunks - takes the parameters above along with a file format string and exports CSV(s) into the location specified by format string

Below are some use-cases along with the resulting SQL:

Export the data older than a week in one csv file without deleting it

SELECT export_chunks(test_table, now() - interval ’1 week’, 
‘/tmp/%hypertable%_%timestamp%.csv’,
 csv_per_chunk=FALSE);

Export the data older then one week in csv files (one per chunk) and delete it

BEGIN;
SELECT export_chunks(test_table, now() - interval ’1 week’, 
‘/tmp/%hypertable%_%chunk%.csv’);
SELECT drop_chunks(now() - interval ’1 week’, ‘test_table’);
COMMIT;

And this is the exposed API:

--hypertable_name - name of the timescale table
--older_than - a time indicator constraint defining the end of 
--  time range window from which chunks are selected 
--returns - a column containing chunk tables that satisfy the constraint.
FUNCTION show_chunks (
     hypertable_name REGCLASS,
     older_than INTERVAL = NULL,
)  RETURNS SETOF REGCLASS

--hypertable_name - name of the timescale table
--older_than - a time indicator constraint defining the end of
--  time range window from which chunks are selected 
--format - formatting of file name(s) for the output.
--  The value needs to be an absolute file path 
--  with the following variables supported:
--  %hypertable_name%, %chunk_name%, %chunk_starttime%, %chunk_endtime%, 
--  %epoch%, %timestamp%
--overwrite - if true, will overwrite existing files
--csv_per_chunk - if true, will produce one csv file per chunk table, 
--  otherwise, will produce a single csv file for all of the data. 
--  Note that format should have a chunk identifying variable 
--  if and only if csv_per_chunk is set to true
FUNCTION export_chunks(
     hypertable_name REGCLASS,
     older_than INTERVAL = NULL,
     format TEXT,
     overwrite BOOLEAN=FALSE,
     csv_per_chunk BOOLEAN=TRUE
)  RETURNS VOID

Note that export_chunks would perform a server side copy.
Client side copy (and other desired business specific export functionality) can be achieved with the help of a simple python script and show_chunks function.
As an example, this is how chunks could be exported to separate files.

import psycopg2
import os

conn = psycopg2.connect("dbname=postgres user=myuser")

cur = conn.cursor()
cur.execute("SELECT show_chunks(test_table, now() - interval ’1 week’);")
records = cur.fetchall();
for record in records:
    chunk = record[0]
    path = "/tmp/{hypertable}/{chunk}_dump.csv"
                                    .format(hypertable="test_table",chunk=chunk)

    if (os.path.isfile(path)):
        exit("file %s exists" % path)

    if not os.path.exists(os.path.dirname(path)):
        os.makedirs(os.path.dirname(path))
        # optionally remove already exported files 
    with open(path, 'w+') as dest:
        cur.copy_expert("COPY {chunk} TO \'{path}\' WITH CSV HEADER"
                                    .format(chunk=chunk, path=path), dest)
cur.execute("SELECT drop_chunks(now() - interval ’1 week’, 'test_table')")
conn.commit()
cur.close()
conn.close()

Data Retention community-request enhancement feedback-wanted

Source

Ngalstyan4

👍7

Most helpful comment

Support for a workflow like this would be very helpful:

Backup data older than e.g., 7 days, i.e., similar to "drop_chunks()" but instead of dropping them - back them up
Drop the same data that was just backed up, i.e., data older than 7 days - the "drop_chunks()" would do this
At a later time, when needed, restore one or more of the backed up chunks

So this way, I could e.g., create daily backups of data that is older than 7 days and limit the active database to just 7 days worth of data. Then, when I need to look at older data for troubleshooting / analysis / or something else, I can restore one or more of the daily chunk backups to another instance of TimescaleDB to work with as needed.

buzz1000 on 22 Apr 2019

👍2

All 16 comments

Without having much of substance to add, I'll say this looks like a suitable & desirable feature as described.

TSheahan on 27 Jun 2018

This actually looks really good - you could use show_chunks to do data aging to tablespaces hosted on slower disk (or even to other servers via foreign tables) as well.

jamessewell on 27 Jun 2018

This can be really useful for me, and I have an additional requirement for a newer_than filter!

eduardotsj on 29 Jun 2018

@eduardotsj interesting, do you mind providing more detail on the use case for your newer_than filter?

dianasaur323 on 14 Aug 2018

This is my main use case to the export chunks with a newer_than option:

We deploy our solution on premises and when we need to setup a good dev DB and mainly when we have to do complex bug analysis we need to bring a DB copy to our office over poor network connection. Our current DB is already about 250GB so it's impossible to download on viable time. So we dump the reference tables (excluding schema _timescaledb_internal) and then dump the newest chunks individually. We get the list using this query:
SELECT chunk_id, chunk_table, to_timestamp(lower(ranges[1])/1000000) as range_start, to_timestamp(upper(ranges[1])/1000000) as range_end
FROM chunk_relation_size('{Hypertable}') where to_timestamp(upper(ranges[1])/1000000) > now();
For debuging old data usually is not relevant.

eduardotsj on 17 Aug 2018

@eduardotsj excellent, that makes perfect sense. Thank you for the additional context. I'll mark you down as being interested in this feature, and update you as we have more progress.

dianasaur323 on 18 Aug 2018

Export the data older then one week in csv files (one per chunk) and delete it

What happens if changes are made to the chunk while it's being exported? Would updates be lost when calling drop_chunks?

I'd only use this if drop_chunks supported an AS OF SYSTEM TIME clause or some way to guarantee any updates during the export are not lost.

alanhamlett on 20 Sep 2018

Related to #285, #350, #563, #642.

alanhamlett on 23 Sep 2018

@alanhamlett We are now actively working on this feature. Regarding what happens when changes are made to drop_chunks, we are thinking of support transactional semantics, so any changes that occur on a chunk actively being dropped will be rejected. How does this approach sound?

dianasaur323 on 5 Oct 2018

👍1

That would work for me, since the application layer could handle any write failures gracefully.

alanhamlett on 5 Oct 2018

Wonderful, thank you for the feedback @alanhamlett! Will reping here once I have more updates.

dianasaur323 on 5 Oct 2018

Not implemented yet, so re-opening.

dianasaur323 on 30 Nov 2018

👍1

@dianasaur323 is there an estimate timeline for this feature? Thanks!

pv97 on 7 Jan 2019

👍1

Support for a workflow like this would be very helpful:

Backup data older than e.g., 7 days, i.e., similar to "drop_chunks()" but instead of dropping them - back them up
Drop the same data that was just backed up, i.e., data older than 7 days - the "drop_chunks()" would do this
At a later time, when needed, restore one or more of the backed up chunks

buzz1000 on 22 Apr 2019

👍2

@pv97 apologies for taking a while to respond. we are actually working on a different feature in this release, so no specific timeline yet. For now, you'll have to go with a more manual approach.

dianasaur323 on 11 May 2019

In 2.0 you will be able to implement this with a user-defined action.