Clickhouse: feature: automatically convert string column block with low cardinality into int column block with `id -> value` mapping before compression

Created on 29 Nov 2017 · 14Comments · Source: ClickHouse/ClickHouse

The issue

Suppose we have Domain column of String type in a MergeTree table. The column contains small number (up to 100) of domains (60 bytes each). ClickHouse compresses such column in blocks with sizes in the range min_compress_block_size (currently 64Kb) - max_compress_block_size (currently 1Mb).

Currently ClickHouse compresses string column blocks as-is without any pre-processing, which may result in not very good compression rates for blocks containing low cardinality values.

The solution

8K rows containing 100 unique domains with 60 bytes average size may be pre-processed into 100*60=6Kbyte dictionary plus 8K x 1byte dictionary intexes before the compression. The dictionary may be sorted before the compression in order to achieve even better compression rate.

ClickHouse may estimate cardinality of each string column block before compression and pre-process the block as described above if it has low number of unique values.

As an additional benefit such compression should improve SELECT query speed, since:

1) column expressions may be matched against dictionary values without decompressing dictionary indexes;
2) matching against dictionary indexes should be faster than matching against values, since dictionary indexes are much smaller in size comparing to the corresponding string values.

feature performance

Source

valyala

🚀2

Most helpful comment

Solved.

alexey-milovidov on 24 Jan 2019

🎉5 👍1

All 14 comments

See a simple program for evaluation potential win when converting string column blocks into dictionary-indexed int column blocks - https://play.golang.org/p/zLK9z6FMOK .

It generates count strings containing uniqCount unique values. The average string length is avgLen. Then it constructs three blocks from these strings and reports their sizes:

uncompressed block - plain;
compressed block - noDict (standard gzip is used for the compression, results for zstd are similar);
pre-processed and compressed block - dict.

Here are a few sample outputs from the program:

count=8192, uniqCount=100, avgLen=60
plain size: 530653
noDict size: 27748, rate: 19.12
dict size: 12518, rate: 42.39
Improvement 221.665%

count=8192, uniqCount=256, avgLen=15
plain size: 152949
noDict size: 22609, rate: 6.76
dict size: 12105, rate: 12.64
Improvement 186.774%

The following case emulates enum usage via String columns, where enum contains 5 unique values per block with the average size of 10 bytes:

count=8192, uniqCount=5, avgLen=10
plain size: 108010
noDict size: 5226, rate: 20.67
dict size: 3085, rate: 35.01
Improvement 169.400%

A single unique value in the block:

count=8192, uniqCount=1, avgLen=10
plain size: 90112
noDict size: 220, rate: 409.60
dict size: 54, rate: 1668.74
Improvement 407.407%

Every case shows clear win in compression ratio when the number of unique strings in the block is less than 256.

valyala on 7 Dec 2017

Sorry, but it sounds like reinventing the wheel. If you don't know the properties of the strings - cardinality, entropy etc. - the best universal way of making strings smaller is called "compression" :) Most of compression algorithms use approach similar to what you descibe. ClickHouse uses LZ4 (or optionally zstd) which in common case deal good with compression of such strings. You can adjust compression settting if default not satisfy you.

Clickhouse can't know ahead the properties of your data: may be after million of identical strings you will start putting random data there. If you, as a database adminiatrator know the properties of your data - you can use enum, or dict BEFORE putting data in clickhouse. And you will get the effect you expect.

filimonov on 7 Dec 2017

I've just checked how clickhouse compress such a strings - in my case i have a message field in one of our tables. That is a string field, which currently have 25 different values of length 9-25 chars. In about 50% rows that field is empty.

Also the same table have enum field status which can have 3 different values, and can be used as a reference.

For a dataset of size about 700 mln of rows i have the following result:

    name, 
    formatReadableSize(sum(data_compressed_bytes)) AS compressed, 
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed, 
    (sum(data_compressed_bytes) * 100) / sum(data_uncompressed_bytes) AS compress_ratio
FROM system.columns 
WHERE (table = 'table') AND (name IN ('message', 'status', 'logdate'))
GROUP BY name
ORDER BY sum(data_compressed_bytes) ASC

┌─name────┬─compressed─┬─uncompressed─┬─────compress_ratio─┐
│ status  │ 38.89 MiB  │ 658.34 MiB   │  5.907979847191937 │
│ message │ 141.37 MiB │ 6.07 GiB     │ 2.2743705225173136 │
│ logdate │ 2.10 GiB   │ 2.57 GiB     │  81.77627195011733 │
└─────────┴────────────┴──────────────┴────────────────────┘

So, yes, as you can expect enum is smaller. But compression rate of strings with low cardinality is really good.

And the size of compressed column message is negligible comparing for example with timestamp column (i've added logdate column stats as a reference, which is timestamp field).

filimonov on 8 Dec 2017

Sorry, but it sounds like reinventing the wheel. If you don't know the properties of the strings - cardinality, entropy etc. - the best universal way of making strings smaller is called "compression" :) Most of compression algorithms use approach similar to what you descibe. ClickHouse uses LZ4 (or optionally zstd) which in common case deal good with compression of such strings. You can adjust compression settting if default not satisfy you.

The proposed transformation doesn't substitute compression - it is applied before the compression in order to improve the overall compression rate. The simulation shows significant gains, which may result in both space savings and performance improvements for SELECT queries.

Clickhouse can't know ahead the properties of your data: may be after million of identical strings you will start putting random data there

The proposed transformation is applied independently to each column block, not to the whole column. The case with identical strings followed by random data should work perfectly if different column blocks have different data properties.

If you, as a database adminiatrator know the properties of your data - you can use enum, or dict BEFORE putting data in clickhouse. And you will get the effect you expect.

enum and dictionary-based columns aren't flexible enough, since they require additional housekeeping from the user - either altering enum column with new values or adding new entries in the dictionary.

I've just checked how clickhouse compress such a strings - in my case i have a message field in one of our tables. That is a string field, which currently have 25 different values of length 9-25 chars.

The simulation shows x2 better compression ratio for the case with 25 unique values 9-25 chars each if these values are randomly distributed in the column block with 8192 items - the recommended size for MergeTree tables:

count=8192, uniqCount=25, avgLen=16
plain size: 161094
noDict size: 11598, rate: 13.89
dict size: 5466, rate: 29.47
Improvement 212.184%

valyala on 8 Dec 2017

I think it would be perhaps easier and more general solution to implement support for zstd dictionaries.
Merging phase could train dictionary for each column and part, which would be used for compression and decompression. See the API https://github.com/facebook/zstd/blob/dev/lib/dictBuilder/zdict.h#L54

The decompression using dictionaries should be faster as well https://github.com/facebook/zstd#the-case-for-small-data-compression

vavrusa on 12 Dec 2017

👍1

Separate dictionary for each block? Wouldn't it take twice the time?
What about auto-expanding enum? That would add new value automatically when encountered. mapd does that.

Imaskar on 21 Jan 2018

FYI, LocustDB implemented dictionary encoding for string column blocks with low cardinality. This gave 50x speedup comparing to clickhouse on the following query on New York taxi rides dataset:

SELECT 
    pickup_ntaname, 
    count()
FROM trips 
GROUP BY pickup_ntaname

See dictionary encoding results at https://clemenswinter.com/2018/07/09/how-to-analyze-billions-of-records-per-second-on-a-single-desktop-pc/2/ for details.

valyala on 10 Jul 2018

👍2

@KochetovNicolai is doing this task in
https://github.com/yandex/ClickHouse/tree/data-type-with-dictionary
branch.

alexey-milovidov on 10 Jul 2018

Task description: https://gist.githubusercontent.com/alexey-milovidov/ebdc6a0b8731fdba9438bda3ec6e8ca4/raw/8321be55ec0e63ed5aaec28b702aa52c280eb87c/column_with_dictionary_task.txt

alexey-milovidov on 10 Jul 2018

Hi @KochetovNicolai, do you have an estimation when this feature is going to be ready ?
Is the branch already testable ?

champtar on 17 Jul 2018

PR #2830

champtar on 8 Aug 2018

👍2

@KochetovNicolai , @alexey-milovidov , as I understand this feature has been already implemented

valyala on 24 Oct 2018

Solved.

alexey-milovidov on 24 Jan 2019

🎉5 👍1

My complements to @valyala for raising this issue and of course to @alexey-milovidov, @KochetovNicolai and others that contributed to its implementation. Although this is solved I have opened a discussion on its performance and possibly an alternative more efficient implementation using bit-stuffed pointers (#4796) and binary operations. At an upper level, I investigate the possibility of avoiding the use of secondary indexes by maintaining 'symbol tables', i.e. dictionaries that can be linked to associations/tuples (see also #3573) using 2D/3D dimension keys that are related to table, column, row. That said I also foresee the involvement and use of ClickHouse dictionaries in such implementation as an upper layer...