Timescaledb: manual compression of a table

Created on 1 Dec 2019 · 11Comments · Source: timescale/timescaledb

According to the following document, it is possible to manually compress only individual chunks (see Manual Compression section)

https://docs.timescale.com/latest/using-timescaledb/compression

It would be nice to have a function to compress all uncompressed chunks. It shall exclude last chunk, which might not be full yet.

investigate question

Source

wrobell

👍1

Most helpful comment

I have just discovered today

SELECT compress_chunk(i) from show_chunks('conditions', newer_than, older_than) i;

From: https://docs.timescale.com/latest/using-timescaledb/compression

This is what I was looking for, so closing.

wrobell on 8 Feb 2020

👍2

All 11 comments

Hi @wrobell Can you say more about how the automated policy approach doesn't work for you? I'll have the effect of compressing all uncompressed chunks other than the last K based on your policy.

mfreed on 1 Dec 2019

@mfreed I should be bit more specific, indeed.

My use case is as follows

Create TimescaleDB database.
Create table with compression enabled and chunk set to 24h.
Set automatic chunk compression policy to 24h.
Import few years (or months) of archival data into the table.

Now, I need to wait 24h, I assume, for the compression of chunks of archival data to happen. Sometimes more time is required, but I have not nailed it why.

After import of a large dataset into a table, it would be nice to trigger the compression of the chunks immediately.

wrobell on 1 Dec 2019

So yes, you are correct, the limits here are around what we can do with compressed chunks (right now a compressed chunk can not be written too until it is decompressed) so... if we accept that as a current limitation, perhaps you might consider bringing the time to full for your chunks to something less than 24 hours to be able to compress faster? If you were to bring this down to 12 hours now you are cutting your wait to compress time in half? In terms of the time that it takes to complete the compression process, this may be variable based on resource consumption (do you have enough CPU etc.).

bboule on 3 Dec 2019

Thanks for the suggestions.

To put the beaviour of automatic chunk compression policy into perspective. I am importing about 120 million rows within 6 minutes on my laptop (and could be faster with better disk). Then I need to wait for hours to reduce 10GB dataset to 0.5GB.

If you think that the execution of automatic compression policies could be improved in the context of data import, please let me know. I will create new ticket and close this one as won't fix.

wrobell on 4 Dec 2019

So I think getting the data into the compression process (done via policy ) faster and automatically will help, but I wonder about the resources required to do this (compression tends to be CPU hungry) and so I would temper any expectations against the available resources (just something to be mindful of as you run your test). Let us know how it goes and we can go from there!!

bboule on 4 Dec 2019

Hi @wrobell from your above message, just want to clarify something if there is any confusion: the 24h corresponds to timestamp time, not wall clock time.

mfreed on 5 Dec 2019

Does it mean that if you fill first chunk and start creating 2nd one, then first chunk shall be compressed immediately?

wrobell on 5 Dec 2019

@bboule The CPU question is interesting one. Some other systems will compress data on the fly, i.e. HDF5. Then you trade CPU usage for IO saving. With TimescaleDB, I assume, the data is saved uncompressed, then read uncompressed to save it again, compressed this time? Or is it more advanced under the hood?

wrobell on 5 Dec 2019

Let me provide an example as well.

Create database

createdb test-1
psql -f db-init.sql test-1   # see below for the db-init.sql file

Run the data-load.py script. Below, I provide two versions of the script. First using PostgreSQL COPY command and second INSERT statements. To run a script

python3 data-load.py postgresql://localhost/test-1

Monitor chunks compression with

select now(), hypertable_name, chunk_name, compression_status from timescaledb_information.compressed_chunk_stats;

After few hours of monitoring - none of the chunks are compressed.

The db-init.sql script

create extension timescaledb;

create table data (
  time timestamptz not null,
  v1 float,
  v2 float,
  v3 float
);

select create_hypertable('data', 'time', chunk_time_interval => '1day'::interval);
alter table data set (timescaledb.compress, timescaledb.compress_orderby = 'time');
select add_compress_chunks_policy('data', '1day'::interval);

The data-load.py script (COPY version)

import asyncpg
import asyncio
import numpy as np
import sys
from datetime import datetime

async def load_data(db_uri):
    conn = await asyncpg.connect(db_uri)
    try:
        copy = conn.copy_records_to_table

        end = datetime.utcnow().timestamp()
        start = end - 48 * 3600
        tv = np.arange(start * 1e6, end * 1e6, 1e6).astype('datetime64[us]')

        n = len(tv)
        v1 = np.random.random(n)
        v2 = np.random.random(n)
        v3 = np.random.random(n)

        data = zip(tv.tolist(), v1, v2, v3)
        await copy('data', columns=('time', 'v1', 'v2', 'v3'), records=data)

    finally:
        await conn.close()

db_uri = sys.argv[1]
asyncio.run(load_data(db_uri))

The data-load.py script (INSERT version)

import asyncpg
import asyncio
import numpy as np
import sys
from datetime import datetime

SQL = "insert into data (time, v1, v2, v3) values ($1, $2, $3, $4)"

async def load_data(db_uri):
    conn = await asyncpg.connect(db_uri)
    try:
        end = datetime.utcnow().timestamp()
        start = end - 48 * 3600
        tv = np.arange(start * 1e6, end * 1e6, 1e6).astype('datetime64[us]')

        n = len(tv)
        v1 = np.random.random(n)
        v2 = np.random.random(n)
        v3 = np.random.random(n)

        data = zip(tv.tolist(), v1, v2, v3)
        await conn.executemany(SQL, data)

    finally:
        await conn.close()

db_uri = sys.argv[1]
asyncio.run(load_data(db_uri))

wrobell on 5 Dec 2019

I have just discovered today

SELECT compress_chunk(i) from show_chunks('conditions', newer_than, older_than) i;

From: https://docs.timescale.com/latest/using-timescaledb/compression

This is what I was looking for, so closing.

wrobell on 8 Feb 2020

👍2

Cool snippet @wrobell this almost could be in the official docs!

jflambert on 11 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Create Index on Hypertable without using the time_column_name

thaxy · 4Comments

Unable to load timescaledb-1.3.2.dll when creating an EXTENSION

AnderssonDavid · 4Comments

Automatic rollup table selection

tkurki · 3Comments

Timescale installer is not finding openssl

arifainchtein · 4Comments

postgresql (timescaledb) - sliding window - finding mean value of temp variable over every 5 seconds.

Manjureka · 4Comments