Suppose we have Domain column of String type in a MergeTree table. The column contains small number (up to 100) of domains (60 bytes each). ClickHouse compresses such column in blocks with sizes in the range min_compress_block_size (currently 64Kb) - max_compress_block_size (currently 1Mb).
Currently ClickHouse compresses string column blocks as-is without any pre-processing, which may result in not very good compression rates for blocks containing low cardinality values.
8K rows containing 100 unique domains with 60 bytes average size may be pre-processed into 100*60=6Kbyte dictionary plus 8K x 1byte dictionary intexes before the compression. The dictionary may be sorted before the compression in order to achieve even better compression rate.
ClickHouse may estimate cardinality of each string column block before compression and pre-process the block as described above if it has low number of unique values.
As an additional benefit such compression should improve SELECT query speed, since:
1) column expressions may be matched against dictionary values without decompressing dictionary indexes;
2) matching against dictionary indexes should be faster than matching against values, since dictionary indexes are much smaller in size comparing to the corresponding string values.
See a simple program for evaluation potential win when converting string column blocks into dictionary-indexed int column blocks - https://play.golang.org/p/zLK9z6FMOK .
It generates count strings containing uniqCount unique values. The average string length is avgLen. Then it constructs three blocks from these strings and reports their sizes:
plain;noDict (standard gzip is used for the compression, results for zstd are similar);dict.Here are a few sample outputs from the program:
count=8192, uniqCount=100, avgLen=60
plain size: 530653
noDict size: 27748, rate: 19.12
dict size: 12518, rate: 42.39
Improvement 221.665%
count=8192, uniqCount=256, avgLen=15
plain size: 152949
noDict size: 22609, rate: 6.76
dict size: 12105, rate: 12.64
Improvement 186.774%
The following case emulates enum usage via String columns, where enum contains 5 unique values per block with the average size of 10 bytes:
count=8192, uniqCount=5, avgLen=10
plain size: 108010
noDict size: 5226, rate: 20.67
dict size: 3085, rate: 35.01
Improvement 169.400%
A single unique value in the block:
count=8192, uniqCount=1, avgLen=10
plain size: 90112
noDict size: 220, rate: 409.60
dict size: 54, rate: 1668.74
Improvement 407.407%
Every case shows clear win in compression ratio when the number of unique strings in the block is less than 256.
Sorry, but it sounds like reinventing the wheel. If you don't know the properties of the strings - cardinality, entropy etc. - the best universal way of making strings smaller is called "compression" :) Most of compression algorithms use approach similar to what you descibe. ClickHouse uses LZ4 (or optionally zstd) which in common case deal good with compression of such strings. You can adjust compression settting if default not satisfy you.
Clickhouse can't know ahead the properties of your data: may be after million of identical strings you will start putting random data there. If you, as a database adminiatrator know the properties of your data - you can use enum, or dict BEFORE putting data in clickhouse. And you will get the effect you expect.
I've just checked how clickhouse compress such a strings - in my case i have a message field in one of our tables. That is a string field, which currently have 25 different values of length 9-25 chars. In about 50% rows that field is empty.
Also the same table have enum field status which can have 3 different values, and can be used as a reference.
For a dataset of size about 700 mln of rows i have the following result:
name,
formatReadableSize(sum(data_compressed_bytes)) AS compressed,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
(sum(data_compressed_bytes) * 100) / sum(data_uncompressed_bytes) AS compress_ratio
FROM system.columns
WHERE (table = 'table') AND (name IN ('message', 'status', 'logdate'))
GROUP BY name
ORDER BY sum(data_compressed_bytes) ASC
ββnameβββββ¬βcompressedββ¬βuncompressedββ¬βββββcompress_ratioββ
β status β 38.89 MiB β 658.34 MiB β 5.907979847191937 β
β message β 141.37 MiB β 6.07 GiB β 2.2743705225173136 β
β logdate β 2.10 GiB β 2.57 GiB β 81.77627195011733 β
βββββββββββ΄βββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββ
So, yes, as you can expect enum is smaller. But compression rate of strings with low cardinality is really good.
And the size of compressed column message is negligible comparing for example with timestamp column (i've added logdate column stats as a reference, which is timestamp field).
Sorry, but it sounds like reinventing the wheel. If you don't know the properties of the strings - cardinality, entropy etc. - the best universal way of making strings smaller is called "compression" :) Most of compression algorithms use approach similar to what you descibe. ClickHouse uses LZ4 (or optionally zstd) which in common case deal good with compression of such strings. You can adjust compression settting if default not satisfy you.
The proposed transformation doesn't substitute compression - it is applied before the compression in order to improve the overall compression rate. The simulation shows significant gains, which may result in both space savings and performance improvements for SELECT queries.
Clickhouse can't know ahead the properties of your data: may be after million of identical strings you will start putting random data there
The proposed transformation is applied independently to each column block, not to the whole column. The case with identical strings followed by random data should work perfectly if different column blocks have different data properties.
If you, as a database adminiatrator know the properties of your data - you can use enum, or dict BEFORE putting data in clickhouse. And you will get the effect you expect.
enum and dictionary-based columns aren't flexible enough, since they require additional housekeeping from the user - either altering enum column with new values or adding new entries in the dictionary.
I've just checked how clickhouse compress such a strings - in my case i have a message field in one of our tables. That is a string field, which currently have 25 different values of length 9-25 chars.
The simulation shows x2 better compression ratio for the case with 25 unique values 9-25 chars each if these values are randomly distributed in the column block with 8192 items - the recommended size for MergeTree tables:
count=8192, uniqCount=25, avgLen=16
plain size: 161094
noDict size: 11598, rate: 13.89
dict size: 5466, rate: 29.47
Improvement 212.184%
I think it would be perhaps easier and more general solution to implement support for zstd dictionaries.
Merging phase could train dictionary for each column and part, which would be used for compression and decompression. See the API https://github.com/facebook/zstd/blob/dev/lib/dictBuilder/zdict.h#L54
The decompression using dictionaries should be faster as well https://github.com/facebook/zstd#the-case-for-small-data-compression
Separate dictionary for each block? Wouldn't it take twice the time?
What about auto-expanding enum? That would add new value automatically when encountered. mapd does that.
FYI, LocustDB implemented dictionary encoding for string column blocks with low cardinality. This gave 50x speedup comparing to clickhouse on the following query on New York taxi rides dataset:
SELECT
pickup_ntaname,
count()
FROM trips
GROUP BY pickup_ntaname
See dictionary encoding results at https://clemenswinter.com/2018/07/09/how-to-analyze-billions-of-records-per-second-on-a-single-desktop-pc/2/ for details.
@KochetovNicolai is doing this task in
https://github.com/yandex/ClickHouse/tree/data-type-with-dictionary
branch.
Hi @KochetovNicolai, do you have an estimation when this feature is going to be ready ?
Is the branch already testable ?
PR #2830
@KochetovNicolai , @alexey-milovidov , as I understand this feature has been already implemented
Solved.
My complements to @valyala for raising this issue and of course to @alexey-milovidov, @KochetovNicolai and others that contributed to its implementation. Although this is solved I have opened a discussion on its performance and possibly an alternative more efficient implementation using bit-stuffed pointers (#4796) and binary operations. At an upper level, I investigate the possibility of avoiding the use of secondary indexes by maintaining 'symbol tables', i.e. dictionaries that can be linked to associations/tuples (see also #3573) using 2D/3D dimension keys that are related to table, column, row. That said I also foresee the involvement and use of ClickHouse dictionaries in such implementation as an upper layer...
Most helpful comment
Solved.