Zstd: Get dictionary from compressed data instead of training

Created on 21 Jul 2019 · 9Comments · Source: facebook/zstd

Hi,

At ROOT (https://github.com/root-project/) we are trying to use ZSTD to compress data buckets of data that are potentially similar between them. Hence, we thought about the idea of creating a dictionary using only the first bucket and re-use it for rest. This way we would need to save the dictionary only once.

I would like to know what is the way (if there is any) of generating a dictionary from a given data. I know that we can use the training feature but for only one sample I guess that is rather inefficient, we would like to in the same pass compress and retrieve the dictionary. Then we can re-use this dictionary for the following buckets.
Something like:
ZSTD_compressAndReturnDict(..., &dictBuffer)

In addition, I would like to ask if you are aware or have some estimation about the time required for training in comparison to compressing based on the size of the file. Something like: for the same file training usually takes around 0.3 times what it takes to compress. If the overhead is not too high perhaps, in cases we just want to focus in high compression ratio and the buckets are small, it could be interesting.

Thanks!

question

Source

fylux

Most helpful comment

Looking back at the scenario presented,
the striking point is that chunk sizes can vary a lot.
This generally means that, although there are some similarities, successive chunks are not subtle variations of each other.

As a consequence, using the first chunk as a dictionary for the others can lead to very different outcome, since this first chunk can be small, or very large.

One good news is that @senhuang42 is working on a patch ( #1824) which will reduce risks of bad compression ratio when using a small dictionary to compress a large file.

This patch will be present in next release. It should indirectly improve performance for such scenario.

Cyan4973 on 23 Oct 2019

🎉1 👍1

All 9 comments

Hi, @fylux.

Zstd accepts both structured and unstructured dictionaries. Unstructured dictionaries are simply a bytestream that zstd will try to find matches in when compressing. Structured dictionaries include three things, a header, some entropy statistics, and unstructured data. See the format specification for more.

Dictionary training is a reduction process that takes many samples, selects the substrings that show up in many samples, collects them into a single blob, and adds the metadata and statistics. This process doesn't really make sense in the context of a single sample. But it's also unnecessary, unless you need the metadata. You can simply pass your first sample (uncompressed) to ZSTD_createCDict().

felixhandte on 22 Jul 2019

Hi @felixhandte, thank you very much for taking the time to answer.

So I am not sure if I understood what ZSTD_createCDict() does. Given a chunk of data what of these does:

Returns the dictionary that would be most suitable for compressing that data
Considers that data as a dictionary
If is the second case it would be not useful since the data itself is not a good dictionary.

You metioned that using training for one sample is unnecessary, definitely it looks like that, but it looks like the only function that can generate a dictionary from a given data.

Finally, about efficiency do you have an idea about the time that it takes to train?

Thanks!

fylux on 22 Jul 2019

we thought about the idea of creating a dictionary using only the first bucket and re-use it for rest

This idea has already been tested, and indeed, it brings some compression benefits.
A big advantage is that there is virtually no runtime cost, because ___there is no training___ : just re-use the first bucket as a starting point for compression and decompression. This property is quite important for an in-line mechanism.

The disadvantage is that the first bucket must really be a good sample, representative of future samples. Also, it won't be able to store any statistics, which could have saved about ~100 bytes per block. Not huge, but also not negligible on tiny data.

Anyway, while it's not as powerful as a "real" dictionary, it's generally positive, and is a lot faster and simpler to setup, so definitely something useful to attempt.

How to do that ?

As mentioned by @felixhandte , just ZSTD_createCDict() on the content of this first bucket. It will create an unstructured dictionary, that can be re-used any number of times when compressing later buckets, using ZSTD_compress_usingCDict().
The same process will have to be followed on the decompression side, using ZSTD_createDDict() on the first _uncompressed_ bucket, and then ZSTD_decompress_usingDDict().

And that's really all there is.

Cyan4973 on 23 Jul 2019

Thank you very much for the clarification @Cyan4973

I have been analyzing the behavior of ZSTD_compress_usingCDict() and in some use cases we have been able to improve the compression ratio when the first bucket is representative and the dictionary represents a significant overhead.

However I have been sadly surprised about the speed of using ZSTD_compress_usingCDict. Since the dictionary is already built I expected it to be faster than ZSTD_compress but I tried in two files and reusing the dictionary was between 1.5 and 3 times slower.

Any idea about why is it slower?

Also a simple question, how can I know the size of the dictionary? Is ZSTD_sizeof_CDict fine or it tells about the allocated space?

Thanks!

fylux on 31 Jul 2019

Tuning the performance with dictionaries is tricky. There's a lot going on under the covers to guess how best to compress your data. Can you tell me a little more about your specific compression setup? To start with:

What's the size of each chunk you're compressing? And what's the size of the dictionary? (I assume they're the same.)
How many chunks are you compressing in a sequence?
What compression level are you using?
What compression ratio are you achieving with and without the dictionary?

felixhandte on 31 Jul 2019

@felixhandte

The size varies a lot. It can go from 27KB to 29371KB. Usually the first chunk used for the dictionary is rather big. The size of the dictionary I don't know exactly how is it defined, I run create_CDict on the first chunk.
Compressing between 7 - 10 chunks in a sequence
Compression level 5
The ratio by using dictionary is between the same and 1.15 better depending on the case.

I would like to mention that as it obvious, the times it take is partially related with the compression ratio since in those cases where the compression ratio with dictionary is much better it is faster since there are way less bytes to write.

fylux on 2 Aug 2019

When using a dictionary for compression, there are 2 opposite impacts on speed :

by finding more and larger matches, it compresses more, using less sequences (hence faster)
there is _more stuff to search into_ (hence slower)

The first impact is present immediately, and mostly depends on dictionary's effectiveness. The second impact only starts to become sensible at level 5+, because these levels authorize multiple search steps, in increasing numbers with compression level.

Since you mention using level 5, that's likely what you are experiencing.

Cyan4973 on 2 Aug 2019

But @Cyan4973, in average for the same compression ratio, shouldn't be faster Dictionary Compression since it does not need to create the dictionary at the same time that it compress?

I will check what happens at lower compression levels.

fylux on 2 Aug 2019

As a consequence, using the first chunk as a dictionary for the others can lead to very different outcome, since this first chunk can be small, or very large.

One good news is that @senhuang42 is working on a patch ( #1824) which will reduce risks of bad compression ratio when using a small dictionary to compress a large file.

This patch will be present in next release. It should indirectly improve performance for such scenario.

Cyan4973 on 23 Oct 2019

🎉1 👍1

Was this page helpful?

0 / 5 - 0 ratings