Clickhouse: Is it possible to compress old data partitions?

Created on 24 Dec 2019  Β·  6Comments  Β·  Source: ClickHouse/ClickHouse

Hello,

I am using a MergeTree type of table with partition by dates.

Is there a way to compress old partitions?

For example, there are some data ranging two years.

| 2019 | 2020 |
| -------- | --------- |
| Compressed | Uncompressed |

The partition of 2019 is compressed and the newly coming year 2020 is uncompressed.

Because, next year, the data of 2019 may not be read more often than in 2020.

Thank you.

question

Most helpful comment

By default data is compressed by LZ4
https://clickhouse.yandex/docs/en/operations/server_settings/settings/#compression
https://www.altinity.com/blog/2017/11/21/compression-in-clickhouse

Let's test with RANDOM numbers.

Nothing to compress.

create table compressionTest
(A Int64, col_lz4 Int64 Codec(LZ4), col_none Int64 Codec(NONE)) 
Engine=MergeTree order by A;

insert into compressionTest select number,  cityHash64(number), cityHash64(number) 
from numbers(100000000);

SELECT
    formatReadableSize(sum(column_bytes_on_disk)) AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
    table,
    column
FROM system.parts_columns
WHERE (active = 1) AND (table LIKE '%compressionTest%')
GROUP BY
    table,
    column

β”Œβ”€compressed─┬─uncompressed─┬─table───────────┬─column───┐
β”‚ 97.80 MiB  β”‚ 762.94 MiB   β”‚ compressionTest β”‚ A        β”‚
β”‚ 766.51 MiB β”‚ 762.94 MiB   β”‚ compressionTest β”‚ col_lz4  β”‚
β”‚ 763.51 MiB β”‚ 762.94 MiB   β”‚ compressionTest β”‚ col_none β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


select count() from compressionTest prewhere col_lz4 = 666;
Elapsed: 0.055 sec. Processed 100.00 million rows, 800.00 MB (1.81 billion rows/s., 14.46 GB/s.)

select count() from compressionTest prewhere col_none = 666;
Elapsed: 0.062 sec. Processed 100.00 million rows, 800.00 MB (1.61 billion rows/s., 12.87 GB/s.)

LZ4 are faster!!! 1.81 billion rows/s. VS 1.61 billion rows/s. !!!!

Add some duplicates

insert into compressionTest select number,  cityHash64(number), cityHash64(number) 
from numbers(100000000);

insert into compressionTest select number+5,  cityHash64(number), cityHash64(number) 
from numbers(100000000);

insert into compressionTest select number+13,  cityHash64(number), cityHash64(number) 
from numbers(100000000);

optimize table compressionTest final;

β”Œβ”€compressed─┬─uncompressed─┬─table───────────┬─column───┐
β”‚ 101.33 MiB β”‚ 2.98 GiB     β”‚ compressionTest β”‚ A        β”‚
β”‚ 1.59 GiB   β”‚ 2.98 GiB     β”‚ compressionTest β”‚ col_lz4  β”‚
β”‚ 2.98 GiB   β”‚ 2.98 GiB     β”‚ compressionTest β”‚ col_none β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

select count() from compressionTest prewhere col_lz4 = 666;
Elapsed: 0.139 sec. Processed 400.00 million rows, 3.20 GB (2.88 billion rows/s., 23.07 GB/s.)

select count() from compressionTest prewhere col_none = 666;
Elapsed: 0.239 sec. Processed 400.00 million rows, 3.20 GB (1.67 billion rows/s., 13.37 GB/s.)

LZ4 are faster significantly !!! 2.88 billion rows/s. VS 1.67 billion rows/s.. !!!!

All 6 comments

What do you mean "is uncompressed"?
Without LZ4 or ZSTD data compression? No such feature available. Compression is set at server / column level.

Uncompressed data will not make a system faster, because LZ4 is CPU lightweight compression but it saves a lot of disk I/O. So usually LZ4 compression makes a system faster.

@den-crane
Yes, I mean is uncompressed as without LZ4 or ZSTD.
Thank you for your advice.

And can I get some more detailed benchmark about total data read speed(decompression + disk read) comparison with LZ4 and Uncompressed form.

By default data is compressed by LZ4
https://clickhouse.yandex/docs/en/operations/server_settings/settings/#compression
https://www.altinity.com/blog/2017/11/21/compression-in-clickhouse

Let's test with RANDOM numbers.

Nothing to compress.

create table compressionTest
(A Int64, col_lz4 Int64 Codec(LZ4), col_none Int64 Codec(NONE)) 
Engine=MergeTree order by A;

insert into compressionTest select number,  cityHash64(number), cityHash64(number) 
from numbers(100000000);

SELECT
    formatReadableSize(sum(column_bytes_on_disk)) AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
    table,
    column
FROM system.parts_columns
WHERE (active = 1) AND (table LIKE '%compressionTest%')
GROUP BY
    table,
    column

β”Œβ”€compressed─┬─uncompressed─┬─table───────────┬─column───┐
β”‚ 97.80 MiB  β”‚ 762.94 MiB   β”‚ compressionTest β”‚ A        β”‚
β”‚ 766.51 MiB β”‚ 762.94 MiB   β”‚ compressionTest β”‚ col_lz4  β”‚
β”‚ 763.51 MiB β”‚ 762.94 MiB   β”‚ compressionTest β”‚ col_none β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


select count() from compressionTest prewhere col_lz4 = 666;
Elapsed: 0.055 sec. Processed 100.00 million rows, 800.00 MB (1.81 billion rows/s., 14.46 GB/s.)

select count() from compressionTest prewhere col_none = 666;
Elapsed: 0.062 sec. Processed 100.00 million rows, 800.00 MB (1.61 billion rows/s., 12.87 GB/s.)

LZ4 are faster!!! 1.81 billion rows/s. VS 1.61 billion rows/s. !!!!

Add some duplicates

insert into compressionTest select number,  cityHash64(number), cityHash64(number) 
from numbers(100000000);

insert into compressionTest select number+5,  cityHash64(number), cityHash64(number) 
from numbers(100000000);

insert into compressionTest select number+13,  cityHash64(number), cityHash64(number) 
from numbers(100000000);

optimize table compressionTest final;

β”Œβ”€compressed─┬─uncompressed─┬─table───────────┬─column───┐
β”‚ 101.33 MiB β”‚ 2.98 GiB     β”‚ compressionTest β”‚ A        β”‚
β”‚ 1.59 GiB   β”‚ 2.98 GiB     β”‚ compressionTest β”‚ col_lz4  β”‚
β”‚ 2.98 GiB   β”‚ 2.98 GiB     β”‚ compressionTest β”‚ col_none β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

select count() from compressionTest prewhere col_lz4 = 666;
Elapsed: 0.139 sec. Processed 400.00 million rows, 3.20 GB (2.88 billion rows/s., 23.07 GB/s.)

select count() from compressionTest prewhere col_none = 666;
Elapsed: 0.239 sec. Processed 400.00 million rows, 3.20 GB (1.67 billion rows/s., 13.37 GB/s.)

LZ4 are faster significantly !!! 2.88 billion rows/s. VS 1.67 billion rows/s.. !!!!

@achimbab, do you have any more question left about compressed/uncompressed scenario? If everything is clear close the issue, please.

@den-crane
Thank you very much.

@qoega
I will close this issue.

@den-crane
FYI.
I shared my test result.
When it comes to CPU : The slower the disk, the better the compression.
To Disk : Significant disk space efficiency.
But, the default is the best in my case.

| compression method | disk | rows/s |
| --------------------- | ----- | ------- |
| default | 766.34 MiB | 3.89 million |
| lz4 | 2.98 GiB | 2.26 billion |
| none | 2.98 GiB | 2.34 billion |

Specifications

ClickHouse server version 19.19.1 revision 54430

Intel(R) Xeon(R) Silver 4110
256GB memory
NVMe SSD

Nothing to compress.

achimbab.host :) create table compressionTest
:-] (A Int64, col_lz4 Int64 Codec(LZ4), col_none Int64 Codec(NONE))
:-] Engine=MergeTree order by A;

CREATE TABLE compressionTest
(
    `A` Int64,
    `col_lz4` Int64 CODEC(LZ4),
    `col_none` Int64 CODEC(NONE)
)
ENGINE = MergeTree
ORDER BY A

Ok.

0 rows in set. Elapsed: 0.015 sec.

achimbab.host :) insert into compressionTest select number,  cityHash64(number), cityHash64(number)
:-] from numbers(100000000);

INSERT INTO compressionTest SELECT
    number,
    cityHash64(number),
    cityHash64(number)
FROM numbers(100000000)

Ok.

0 rows in set. Elapsed: 6.661 sec. Processed 100.01 million rows, 800.06 MB (15.01 million rows/s., 120.11 MB/s.)

achimbab.host :) SELECT
:-]     formatReadableSize(sum(column_bytes_on_disk)) AS compressed,
:-]     formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
:-]     table,
:-]     column
:-] FROM system.parts_columns
:-] WHERE (active = 1) AND (table LIKE '%compressionTest%')
:-] GROUP BY
:-]     table,
:-]     column

SELECT
    formatReadableSize(sum(column_bytes_on_disk)) AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
    table,
    column
FROM system.parts_columns
WHERE (active = 1) AND (table LIKE '%compressionTest%')
GROUP BY
    table,
    column

β”Œβ”€compressed─┬─uncompressed─┬─table───────────┬─column───┐
β”‚ 382.19 MiB β”‚ 762.94 MiB   β”‚ compressionTest β”‚ A        β”‚
β”‚ 766.51 MiB β”‚ 762.94 MiB   β”‚ compressionTest β”‚ col_lz4  β”‚
β”‚ 763.51 MiB β”‚ 762.94 MiB   β”‚ compressionTest β”‚ col_none β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3 rows in set. Elapsed: 0.003 sec.

achimbab.host :) select count() from compressionTest prewhere col_lz4 = 666;

SELECT count()
FROM compressionTest
PREWHERE col_lz4 = 666

β”Œβ”€count()─┐
β”‚       0 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 rows in set. Elapsed: 0.049 sec. Processed 100.00 million rows, 800.00 MB (2.05 billion rows/s., 16.43 GB/s.)

achimbab.host :) select count() from compressionTest prewhere col_none = 666;

SELECT count()
FROM compressionTest
PREWHERE col_none = 666

β”Œβ”€count()─┐
β”‚       0 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 rows in set. Elapsed: 0.055 sec. Processed 100.00 million rows, 800.00 MB (1.80 billion rows/s., 14.44 GB/s.)

Add some duplicates

achimbab.host :) insert into compressionTest select number,  cityHash64(number), cityHash64(number)
:-] from numbers(100000000);

INSERT INTO compressionTest SELECT
    number,
    cityHash64(number),
    cityHash64(number)
FROM numbers(100000000)

Ok.

0 rows in set. Elapsed: 6.763 sec. Processed 100.01 million rows, 800.06 MB (14.79 million rows/s., 118.30 MB/s.)

achimbab.host :) insert into compressionTest select number+5,  cityHash64(number), cityHash64(number)
:-] from numbers(100000000);

INSERT INTO compressionTest SELECT
    number + 5,
    cityHash64(number),
    cityHash64(number)
FROM numbers(100000000)

Ok.

0 rows in set. Elapsed: 6.732 sec. Processed 100.01 million rows, 800.06 MB (14.86 million rows/s., 118.84 MB/s.)

achimbab.host :) insert into compressionTest select number+13,  cityHash64(number), cityHash64(number)
:-] from numbers(100000000);

INSERT INTO compressionTest SELECT
    number + 13,
    cityHash64(number),
    cityHash64(number)
FROM numbers(100000000)

Ok.

0 rows in set. Elapsed: 6.760 sec. Processed 100.01 million rows, 800.06 MB (14.79 million rows/s., 118.36 MB/s.)

achimbab.host :) optimize table compressionTest final;

OPTIMIZE TABLE compressionTest FINAL

Ok.

0 rows in set. Elapsed: 43.683 sec.

achimbab.host :) SELECT
:-]     formatReadableSize(sum(column_bytes_on_disk)) AS compressed,
:-]     formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
:-]     table,
:-]     column
:-] FROM system.parts_columns
:-] WHERE (active = 1) AND (table LIKE '%compressionTest%')
:-] GROUP BY
:-]     table,
:-]     column

SELECT
    formatReadableSize(sum(column_bytes_on_disk)) AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
    table,
    column
FROM system.parts_columns
WHERE (active = 1) AND (table LIKE '%compressionTest%')
GROUP BY
    table,
    column

β”Œβ”€compressed─┬─uncompressed─┬─table───────────┬─column───┐
β”‚ 766.34 MiB β”‚ 2.98 GiB     β”‚ compressionTest β”‚ A        β”‚
β”‚ 1.59 GiB   β”‚ 2.98 GiB     β”‚ compressionTest β”‚ col_lz4  β”‚
β”‚ 2.98 GiB   β”‚ 2.98 GiB     β”‚ compressionTest β”‚ col_none β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3 rows in set. Elapsed: 0.004 sec.

achimbab.host :) select count() from compressionTest prewhere col_lz4 = 666;

SELECT count()
FROM compressionTest
PREWHERE col_lz4 = 666

β”Œβ”€count()─┐
β”‚       0 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 rows in set. Elapsed: 0.177 sec. Processed 400.00 million rows, 3.20 GB (2.26 billion rows/s., 18.05 GB/s.)

achimbab.host :) select count() from compressionTest prewhere col_none = 666;

SELECT count()
FROM compressionTest
PREWHERE col_none = 666

β”Œβ”€count()─┐
β”‚       0 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 rows in set. Elapsed: 0.171 sec. Processed 400.00 million rows, 3.20 GB (2.34 billion rows/s., 18.69 GB/s.)
Was this page helpful?
0 / 5 - 0 ratings