Zstd: Why dictionary compression can make the ratio so much worse than without one

Created on 26 Jul 2019  路  4Comments  路  Source: facebook/zstd

Compressing an XML file of >10MB with a dictionary trained from dozens of XMLs (samples include the big XML), results in about 26% ratio; while compressing it without a dictionary gets <10% ratio, both using default compression level.

I read that dictionary compression works best with many small files, but trying to understand why dictionary compression can make the ratio so much worse than without one? A dictionary is basically a table prepended to the corpus for offseting when compressing, right?

Would be great to have some insights on how Zstd uses dictionaries. I'd hope that dictionaries do not hurt the performance even in cases when they do not help.

question

Most helpful comment

There are multiple sources. An important one is the size of structures being used for search operations during compression.

Whenever possible, we try to "adapt" the amount of resource needed depending on compression level and input size : it doesn't make sense to reserve and initialize many MB of memory to compress a few KB. But on the other hand, using a few KB of resource to compress MB of input will seriously alter search capabilities, translating into worse compression ratio. So it's a balancing act.

In "normal" compression, the heuristic err towards "large size", because it helps compression ratio.
In "dictionary compression", the heuristic err towards "small size", because it helps speed and memory, and it's supposed to be the target scenario.

Now, we can adapt and correct initial decision during later stages, but sometimes we don't have the information (streaming mode with no srcSize hint), sometimes we don't have the cpu budget (fast modes), or sometimes flexibility is pinned down by choices in the early stages of the decision tree (copy-dictionary-tables mode for example).

So it's pretty complicated, but it doesn't mean there is no solution. It's just complex enough that we need some hard use case to make it worth the investment, which goes beyond just adding some code : it's also about adding complexity, maintaining it, and the hidden cost of weighting down future evolutions. Here also, a balancing act.

All 4 comments

Indeed, dictionaries are mostly useful for small files.
The reason is, they typically save a few KB, which is negligible for large multi-MB file, but is _a lot_ for small files of a few KB.

_In theory_, dictionary compression should not be detrimental for large files either. After all, it's some "additional data", that can simply be discarded if deemed not useful.

In practice though, the algorithm use some decision shortcuts, and kind of "trust" that the dictionary is used for a "suitable" use case : small data, similar to training stage. When that's not the case, it makes bad choices.

We could improve the situation, especially as we have new capabilities now, allowing the algorithm to make more educated choices. It's not trivial though, and still needs to be done.

Thanks @Cyan4973, could you give a concrete example when "bad choice" can hurt the ratio so much? Is there big difference when we offset into dictionary or not?

And sounds like the improvement will not be done any time soon?

There are multiple sources. An important one is the size of structures being used for search operations during compression.

Whenever possible, we try to "adapt" the amount of resource needed depending on compression level and input size : it doesn't make sense to reserve and initialize many MB of memory to compress a few KB. But on the other hand, using a few KB of resource to compress MB of input will seriously alter search capabilities, translating into worse compression ratio. So it's a balancing act.

In "normal" compression, the heuristic err towards "large size", because it helps compression ratio.
In "dictionary compression", the heuristic err towards "small size", because it helps speed and memory, and it's supposed to be the target scenario.

Now, we can adapt and correct initial decision during later stages, but sometimes we don't have the information (streaming mode with no srcSize hint), sometimes we don't have the cpu budget (fast modes), or sometimes flexibility is pinned down by choices in the early stages of the decision tree (copy-dictionary-tables mode for example).

So it's pretty complicated, but it doesn't mean there is no solution. It's just complex enough that we need some hard use case to make it worth the investment, which goes beyond just adding some code : it's also about adding complexity, maintaining it, and the hidden cost of weighting down future evolutions. Here also, a balancing act.

Next release will include a patch by @senhuang42 , which limits the impact of such issues, where using a small dictionary to compress a large file results in worsened compression ratio :

1824 .

Was this page helpful?
0 / 5 - 0 ratings