According to the following document, it is possible to manually compress only individual chunks (see Manual Compression section)
https://docs.timescale.com/latest/using-timescaledb/compression
It would be nice to have a function to compress all uncompressed chunks. It shall exclude last chunk, which might not be full yet.
Hi @wrobell Can you say more about how the automated policy approach doesn't work for you? I'll have the effect of compressing all uncompressed chunks other than the last K based on your policy.
@mfreed I should be bit more specific, indeed.
My use case is as follows
Now, I need to wait 24h, I assume, for the compression of chunks of archival data to happen. Sometimes more time is required, but I have not nailed it why.
After import of a large dataset into a table, it would be nice to trigger the compression of the chunks immediately.
So yes, you are correct, the limits here are around what we can do with compressed chunks (right now a compressed chunk can not be written too until it is decompressed) so... if we accept that as a current limitation, perhaps you might consider bringing the time to full for your chunks to something less than 24 hours to be able to compress faster? If you were to bring this down to 12 hours now you are cutting your wait to compress time in half? In terms of the time that it takes to complete the compression process, this may be variable based on resource consumption (do you have enough CPU etc.).
Thanks for the suggestions.
To put the beaviour of automatic chunk compression policy into perspective. I am importing about 120 million rows within 6 minutes on my laptop (and could be faster with better disk). Then I need to wait for hours to reduce 10GB dataset to 0.5GB.
If you think that the execution of automatic compression policies could be improved in the context of data import, please let me know. I will create new ticket and close this one as won't fix.
So I think getting the data into the compression process (done via policy ) faster and automatically will help, but I wonder about the resources required to do this (compression tends to be CPU hungry) and so I would temper any expectations against the available resources (just something to be mindful of as you run your test). Let us know how it goes and we can go from there!!
Hi @wrobell from your above message, just want to clarify something if there is any confusion: the 24h corresponds to timestamp time, not wall clock time.
Does it mean that if you fill first chunk and start creating 2nd one, then first chunk shall be compressed immediately?
@bboule The CPU question is interesting one. Some other systems will compress data on the fly, i.e. HDF5. Then you trade CPU usage for IO saving. With TimescaleDB, I assume, the data is saved uncompressed, then read uncompressed to save it again, compressed this time? Or is it more advanced under the hood?
Let me provide an example as well.
createdb test-1
psql -f db-init.sql test-1 # see below for the db-init.sql file
data-load.py script. Below, I provide two versions of the script. First using PostgreSQL COPY command and second INSERT statements. To run a scriptpython3 data-load.py postgresql://localhost/test-1
select now(), hypertable_name, chunk_name, compression_status from timescaledb_information.compressed_chunk_stats;
The db-init.sql script
create extension timescaledb;
create table data (
time timestamptz not null,
v1 float,
v2 float,
v3 float
);
select create_hypertable('data', 'time', chunk_time_interval => '1day'::interval);
alter table data set (timescaledb.compress, timescaledb.compress_orderby = 'time');
select add_compress_chunks_policy('data', '1day'::interval);
The data-load.py script (COPY version)
import asyncpg
import asyncio
import numpy as np
import sys
from datetime import datetime
async def load_data(db_uri):
conn = await asyncpg.connect(db_uri)
try:
copy = conn.copy_records_to_table
end = datetime.utcnow().timestamp()
start = end - 48 * 3600
tv = np.arange(start * 1e6, end * 1e6, 1e6).astype('datetime64[us]')
n = len(tv)
v1 = np.random.random(n)
v2 = np.random.random(n)
v3 = np.random.random(n)
data = zip(tv.tolist(), v1, v2, v3)
await copy('data', columns=('time', 'v1', 'v2', 'v3'), records=data)
finally:
await conn.close()
db_uri = sys.argv[1]
asyncio.run(load_data(db_uri))
The data-load.py script (INSERT version)
import asyncpg
import asyncio
import numpy as np
import sys
from datetime import datetime
SQL = "insert into data (time, v1, v2, v3) values ($1, $2, $3, $4)"
async def load_data(db_uri):
conn = await asyncpg.connect(db_uri)
try:
end = datetime.utcnow().timestamp()
start = end - 48 * 3600
tv = np.arange(start * 1e6, end * 1e6, 1e6).astype('datetime64[us]')
n = len(tv)
v1 = np.random.random(n)
v2 = np.random.random(n)
v3 = np.random.random(n)
data = zip(tv.tolist(), v1, v2, v3)
await conn.executemany(SQL, data)
finally:
await conn.close()
db_uri = sys.argv[1]
asyncio.run(load_data(db_uri))
I have just discovered today
SELECT compress_chunk(i) from show_chunks('conditions', newer_than, older_than) i;
From: https://docs.timescale.com/latest/using-timescaledb/compression
This is what I was looking for, so closing.
Cool snippet @wrobell this almost could be in the official docs!
Most helpful comment
I have just discovered today
From: https://docs.timescale.com/latest/using-timescaledb/compression
This is what I was looking for, so closing.