Hi,
At ROOT (https://github.com/root-project/) we are trying to use ZSTD to compress data buckets of data that are potentially similar between them. Hence, we thought about the idea of creating a dictionary using only the first bucket and re-use it for rest. This way we would need to save the dictionary only once.
I would like to know what is the way (if there is any) of generating a dictionary from a given data. I know that we can use the training feature but for only one sample I guess that is rather inefficient, we would like to in the same pass compress and retrieve the dictionary. Then we can re-use this dictionary for the following buckets.
Something like:
ZSTD_compressAndReturnDict(..., &dictBuffer)
In addition, I would like to ask if you are aware or have some estimation about the time required for training in comparison to compressing based on the size of the file. Something like: for the same file training usually takes around 0.3 times what it takes to compress. If the overhead is not too high perhaps, in cases we just want to focus in high compression ratio and the buckets are small, it could be interesting.
Thanks!
Hi, @fylux.
Zstd accepts both structured and unstructured dictionaries. Unstructured dictionaries are simply a bytestream that zstd will try to find matches in when compressing. Structured dictionaries include three things, a header, some entropy statistics, and unstructured data. See the format specification for more.
Dictionary training is a reduction process that takes many samples, selects the substrings that show up in many samples, collects them into a single blob, and adds the metadata and statistics. This process doesn't really make sense in the context of a single sample. But it's also unnecessary, unless you need the metadata. You can simply pass your first sample (uncompressed) to ZSTD_createCDict().
Hi @felixhandte, thank you very much for taking the time to answer.
So I am not sure if I understood what ZSTD_createCDict() does. Given a chunk of data what of these does:
You metioned that using training for one sample is unnecessary, definitely it looks like that, but it looks like the only function that can generate a dictionary from a given data.
Finally, about efficiency do you have an idea about the time that it takes to train?
Thanks!
we thought about the idea of creating a dictionary using only the first bucket and re-use it for rest
This idea has already been tested, and indeed, it brings some compression benefits.
A big advantage is that there is virtually no runtime cost, because ___there is no training___ : just re-use the first bucket as a starting point for compression and decompression. This property is quite important for an in-line mechanism.
The disadvantage is that the first bucket must really be a good sample, representative of future samples. Also, it won't be able to store any statistics, which could have saved about ~100 bytes per block. Not huge, but also not negligible on tiny data.
Anyway, while it's not as powerful as a "real" dictionary, it's generally positive, and is a lot faster and simpler to setup, so definitely something useful to attempt.
How to do that ?
As mentioned by @felixhandte , just ZSTD_createCDict() on the content of this first bucket. It will create an unstructured dictionary, that can be re-used any number of times when compressing later buckets, using ZSTD_compress_usingCDict().
The same process will have to be followed on the decompression side, using ZSTD_createDDict() on the first _uncompressed_ bucket, and then ZSTD_decompress_usingDDict().
And that's really all there is.
Thank you very much for the clarification @Cyan4973
I have been analyzing the behavior of ZSTD_compress_usingCDict() and in some use cases we have been able to improve the compression ratio when the first bucket is representative and the dictionary represents a significant overhead.
However I have been sadly surprised about the speed of using ZSTD_compress_usingCDict. Since the dictionary is already built I expected it to be faster than ZSTD_compress but I tried in two files and reusing the dictionary was between 1.5 and 3 times slower.
Any idea about why is it slower?
Also a simple question, how can I know the size of the dictionary? Is ZSTD_sizeof_CDict fine or it tells about the allocated space?
Thanks!
Tuning the performance with dictionaries is tricky. There's a lot going on under the covers to guess how best to compress your data. Can you tell me a little more about your specific compression setup? To start with:
@felixhandte
I would like to mention that as it obvious, the times it take is partially related with the compression ratio since in those cases where the compression ratio with dictionary is much better it is faster since there are way less bytes to write.
When using a dictionary for compression, there are 2 opposite impacts on speed :
The first impact is present immediately, and mostly depends on dictionary's effectiveness. The second impact only starts to become sensible at level 5+, because these levels authorize multiple search steps, in increasing numbers with compression level.
Since you mention using level 5, that's likely what you are experiencing.
But @Cyan4973, in average for the same compression ratio, shouldn't be faster Dictionary Compression since it does not need to create the dictionary at the same time that it compress?
I will check what happens at lower compression levels.
Looking back at the scenario presented,
the striking point is that chunk sizes can vary a lot.
This generally means that, although there are some similarities, successive chunks are not subtle variations of each other.
As a consequence, using the first chunk as a dictionary for the others can lead to very different outcome, since this first chunk can be small, or very large.
One good news is that @senhuang42 is working on a patch ( #1824) which will reduce risks of bad compression ratio when using a small dictionary to compress a large file.
This patch will be present in next release. It should indirectly improve performance for such scenario.
Most helpful comment
Looking back at the scenario presented,
the striking point is that chunk sizes can vary a lot.
This generally means that, although there are some similarities, successive chunks are not subtle variations of each other.
As a consequence, using the first chunk as a dictionary for the others can lead to very different outcome, since this first chunk can be small, or very large.
One good news is that @senhuang42 is working on a patch ( #1824) which will reduce risks of bad compression ratio when using a small dictionary to compress a large file.
This patch will be present in next release. It should indirectly improve performance for such scenario.