Zstd: would it be possible to add a silesia level 20 in results.csv ?

Created on 30 Dec 2018  路  9Comments  路  Source: facebook/zstd

level 20 compresses 4% better than level 19 in my similar use case ... maybe a level 19.5 shall exist.

question

All 9 comments

The file results.csv from the regression test suite is not intended to show off the performance of Zstandard.

It's merely a reference point, so that we can automatically trap unintended compression ratio changes due to future code modifications.

We understand that level 20 is supposed to compress more than level 19, though by how much, this all depends on source. Level 20 most important advantage is the increase in Window size to 32 MB, which only benefits large sources. Small files (<8 MB) are expected to benefit too, but only by a little, thanks to more thorough searches and parsing.
How would you describe your use case featuring a 4% difference, which is definitely significant ?

It's a Windows 10 Python distribution: 68.9 K files, 2.19 Go when un-compressed.

On current hardware, it takes roughly as much user time in each step (10'):

  • download,
  • un-compress,
  • or compress in zstd Level 19.

2.19 GB into 68.9K files => average file is about ~32KB.
But it's only the average, it doesn't tell about distribution.
Maybe a few files are large, while others are tiny.

Assuming that none of these files is large enough to be > 8 MB,
I would expect level 19 to provide pretty good performance, close to maximum.
Did you test using latest release zstd v1.3.8 ?

There are a quite few big objects, like 26 big DLL over 8Mo, making 950 Mo.

Yes, zstd-1.3.8 did improve level 19 quite a bit:

version | level 19 | level 20 | level 21 | level 22
--|--|--|--|--
zstd-1.3.7 | 529 889 Ko | 507 810 Ko | 496 892 Ko |
zstd-1.3.8 | 527 981 Ko | 507 196 Ko | | 487 088 Ko

960 MB into 26 files => average DLL size is ~36 MB.

Then, the difference in compression ratio definitely come from these files.
If you could compress your batch of files without them, I would expect level 19 to feature a compression ratio close to level 20.

I don't see any other solution than moving to level 20+.
8 MB is a hard limit for level 19.
We want to guarantee this maximum memory usage for all non --ultra compression levels.

ok, and besides the slowness and huge memory consumption on the compression side, is there any drawback on the uncompression side if I use level 20+ ? I didn't notice a big memory consumption on uncompression side, I expected some by looking at Tino's README https://github.com/mcmilk/7-Zip-zstd

Maybe you improved this a lot since zstd-1.2.0 ?

The decompression will need more memory when using the streaming mode (which is the default mode for the CLI).

Decompression memory usage is unaffected when using direct buffer-to-buffer decompression, which is available through the API (such as ZSTD_decompress()).

Another side effect is that decompression speed can be slower, due to higher cache misses,
but this part is pretty well contained due to dedicated prefetching code,
so for the most part, it should be barely noticeable (< 10%).

Thank you for your time, Yann. Best whishes for levels 20 19 !

Thanks ! You're welcomed !

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vade picture vade  路  3Comments

animalize picture animalize  路  3Comments

escalade picture escalade  路  3Comments

scherepanov picture scherepanov  路  3Comments

planet36 picture planet36  路  3Comments