Zstd: pzstd compression ratios vs zstd

Created on 20 Jan 2017  路  34Comments  路  Source: facebook/zstd

We've run some benchmarks on a number of our internal backups and noticed that while pzstd is beautifully fast, it seems to produce worse results:

time zstd -11 X -o X.11
X : 13.37%   (123965807088 => 16572036784 bytes, X.11) 

real    49m32.875s
user    42m6.636s
sys 0m49.172s


time pzstd -11 -p 1 X -o X.11.1thread
X : 13.76%   (123965807088 => 17056707732 bytes, X.11.1thread)

real    42m50.245s
user    40m33.648s
sys 0m44.436s


time pzstd -11 -p 3 X -o X.11.3threads
X : 13.76%   (123965807088 => 17056707732 bytes, X.11.3threads)

real    21m53.584s  <- bottlenecked by the slow hdd
user    58m14.732s
sys 1m0.036s

Is this part of the design or a bug? We also noticed that pzstd -p 1 ran faster than zstd.

question

Most helpful comment

I agree that it shouldn't have highest priority.

If it's useful, the use case would be decompressing large lvl22-compressed files from a fast SSD into RAM for processing (instead of writing it back to disk). Currently a single core from a top i7 or E3 will not be fast enough to keep up with the data being read from the SSD.

Anyhow, my questions have been answered, so I am closing the ticket. Thanks everyone!

All 34 comments

Thats by design and can not be fixed / improved.
Also note that "-p" activates additional header information, which also adds some bytes to the resulting files.

The last line should be time pzstd -11 -p -n 3 X -o X.11.3threads I think ...

Thank you. Is the 'penalty' the same on every compression level or does it change with each strategy? How does it behave on the ultra levels?

What does -n do? I cannot find any documentation on that.

Hello,

pzstd -h gives a short summary on the options, there is currently no manpage for it.

These two options are added to the default options of zstd:
-n/--num-threads #: Number of threads to spawn
-p/--pzstd-headers: Write pzstd headers to enable parallel decompression

The 'penalty' comes from the additional headers, which are used for each chunk of data. The chuck size depends on the compression level / window size. The extra header size is 12 bytes per chunk...

Thanks for that explanation. Is it possible to disable the additional headers, if I can live without parallel decompression?

I'm getting a very different output on pzstd -h, which version are you on?

pzstd -V
PZSTD version: 1.1.3.
pzstd -h
Usage:
  pzstd [args] [FILE(s)]
Parallel ZSTD options:
  -p, --processes   #    : number of threads to use for (de)compression (default:8)

I thought that I had running version "1.1.2", but it was "1.1.0" sorry about that.
Now I have the same output as you ... I didn't think, that these parameters will change, after version 1.0 was released... my fault :(
Disabling the headers is gone away... I don't find them in the current git source of pzstd.

When you are doing some tests with threading and so on, you my also want to check zstdmt... it will do the same as pzstd, but on my system it's always a bit faster. Maybe the same is true on your system.

Ah that's too bad.

Lastly, is there a way to increase the chunk size (to something gigantic like --ultra uses)? That should decrease the overall 'penalty', correct?

Not in pzstd, but zstdmt has this option for that:
$ zstdmt -h
-t N set number of (de)compression threads (def: #cores)
-b N set input chunksize to N KiB (default: auto)

KiB is the wrong unit for zstd, I will change that to MiB ...
And yes, this should make the files smaller...

I'll look into that.
In the mean time, any chance of introducing a --chunksize via pzstd and bringing back --pzstd-headers in the future?

Would be no problem, but that's not my decision :/
Maybe Yann will add this as an feature request.

@Cyan4973 Should I open a new issue for a feature request or rename this one?

pzstd and @mcmilk's zstdmt are very similar, and run at very nearly the same speed (it was slower and used more memory in version 1.1.0). The extra header information is 12 bytes per chunk, and the chunk sizes are always 4 times the window size. The window size ranges from 1 << 19 to 1 << 27 for levels 1 to 22, so the added bytes are negligible for large files.

The penalty is because each chunk is independent, so there can't be matches across chunks. This is the same as zstdmt. I believe that by default zstdmt selects a smaller chunk size, but if you selected a chunk size to be 4 times time window size, the resulting output size should be essentially the same.

Edit: I forgot I hadn't compared the memory usage with the same chunk size, but I believe that it will be similar.

the resulting output size should be essentially the same.

Do you mean the same between pzstd and zstdmt or between pzstd 1-threaded and zstd?

Between pzstd and zstdmt. There will always be a penalty for multithreaded compression using this method, because it breaks the input up into chunks and compresses each chunk independently in parallel, so it loses any possible matches between each chunk.

Adding an option to remove the pzstd headers is possible, but I'm not sure there is a compelling use case where it makes a difference.

I can add a flag to control the chunk size, but the default scales with the compression level, so should be good enough for the majority of use cases.

A chunk size flag would be very useful. If you add it, I'll report back on how / whether our compression ratios with pzstd improved.

An option in the meantime is just to use a higher compression level. This would have a larger chunk size, and also zstd would perform slightly better for each chunk.

So you'd expect the difference in compression ratios between pzstd and zstd to decrease as the compression level and thus the chunk size increases? I'll do some benchmarking on ultra level 22. That'll run for 2 days, so maybe I'll have a number on Monday ;)

I'm not sure how the difference in compression ratios will change as the level increases, but it likely depends on the structure of the data you are compressing. I'm running tests now, and will update this post when they finish in ~20 minutes. But since pzstd buys you extra time, and it might be bottlenecked by IO anyways, you could use pzstd -12 instead of zstd -11 to make up for the lost compression ratio.

Edit: On the file I was testing the difference between the compression ratios was about 0.3% regardless of the compression level.

Thanks. My results won't be in until Monday.
Would you mind testing whether playing with the chunk size has any effect on this?
As I posted initially, I get a difference of 0.4% in the ratios, which means that zstd is 3% smaller/more efficient than pzstd on my data.
Another thing that I noticed is that, as I watch zstd compress, I see that the ratio was getting better and better (i.e., decreasing) until about 10GB data read in, from which point onwards the ratio was only worsening (i.e., increasing). Is there something about huge files (>100GB) that one should be aware of? Would it be more efficient to compress 10 10GB files rather than 1 100GB file?

An increased chunk size should make the difference smaller, but it will also increase memory usage.

AFAIK it is strictly better to compress one 100 GB file with zstd than ten 10 GB files. The compression ratio might be getting worse because the data starting at the 10 GB is less compressible.

Ok, results are in (44% better than zstd lvl 11):

11279522025 X.22.normal_zstd
11492661645 X.22.pzstd

The difference is now 1.9%, which is perfectly fine given the performance win.
If you'd introduce a flag to modify the chunk size, I'd love to do some more testing.

Regarding the compression ratio after 10GB, while I agree that you are most likely correct in suspecting less compressible data, is there any chance it could be something like the dictionary size, or any other metric that might be maxed out after 10GB?

There shouldn't be any degradation of performance after compressing 10 GB. Its always possible that there is a bug that causes a loss of compression ratio though. If you want to experiment you can split your data at the point where compression ratio starts to worsen, and take a second chunk of some size. Then compress each chunk individually and then together as a whole. If the sum of the individual compressed pieces is smaller than the compressed whole I would be very interested.

If you feel interested, latest update in dev branch features a new target make zstdmt.

It is similar to pzstd, with the main difference being to provide this functionality in both cli and API format, using C language. The intention is to integrate them both into mainlinelibzstd and cli in the near future.

This is still labelled experimental, even though it has endured enough testings now to be expected to work correctly.

Compared to pzstd, the compression ratio takes a reduced hit, because the new format uses less header space, and can reference data across sections. Level 22 may even provide a (very little) better compression that the regular single-threaded version. As you already measured, the difference is fairly small though, so no "big" impact expectation.

zstdmt is very impressive. On lvl 11, its 13.55% is right inbetween zstd (13.37%) and pzstd (13.76%)

time zstdmt -11 -T3 X -o X11.T3
X : 13.55%   (123965807088 => 16795988971 bytes, X.11.T3) 

real    21m53.983s
user    54m57.756s
sys 0m54.204s

On lvl 22 however, while being a little bit slower than pzstd (378 minutes vs 407minutes), its 9.07% outperforms both zstd (9.10%) and pzstd (9.27%) - that is amazing!

Are there any settings that one could tune to improve zstdmt or zstd on lvl11, esp. those that would increase the memory usage?
Also, as I understand it, levels 1 - 19 have about the same decompression speed. What is the expected impact on decompression when using levels 20 - 22?

There are 2 variables which influence the result, though only one is available through command line currently.

The first one controls size of each job.
This is command -B#. For example -B50MB will cut input into independent jobs of 50MB each.
Each job will be compressed by a dedicated thread.

By default, this value is automatically determined depending on compression level.
There are wide differences. A level 3 job is 4 MB, while a level 22 job is 512 MB by default.
Any input smaller than a single job size will be single threaded.

The second parameter is "overlap factor".
This parameter makes each job borrow data from previous one, reducing compression losses.
By default, overlap factor is "windowSize / 8", which captures most of the losses at fast compression levels (<5), but is more "intermediate" at higher compression levels (as you measured at level 11). Level 22 is special, it reloads a full windowSize from previous jobs (128 MB). This is why it can get better compression ratio. But it comes at a CPU cost. Therefore, it's discouraged to set tiny job sizes for level 22, since it would spend most of its time reloading data...

"overlap factor" is currently not exposed through command line, though it could be in the future if there is a need.

Thanks - so running level 11 with -B512MB should yield a better result, correct?
Would it help level 11 further if one could increase the window size?
Also, would this impact decompression? I would think that might then need more memory, but would it be slower as well (like levels 20 - 22 are)?

Large job sizes tend to help compression ratio a bit, though there are diminishing returns.
Without access to overlap factor, it's the only option to improve ratio of level 11.

Overlap factor is not available on CLI.
It would improve compression ratio, but would also slow down speed.
So it's just another trade-off (for better ratio, it's also possible to increase level to 12 for example).

Decompression speed should remain roughly similar, at all levels and settings.
Decompression remains single thread. It's unaware that data was compressed using multi-threading.

Ok perfect, thank you. I'll play around with job sizes a bit more.
Lastly, and sorry to ask this again, should there be any difference between decompressing data that was compressed at levels 1-19 vs 20-22? I believe I had read somewhere that decompressing lvls 20-22 needs much more memory and runs somewhat slower. Is this still the case after its recent speedup?

Yes it's still the case.
Multi-threading doesn't have any impact on this.
We have made some progresses for decompression speed at levels 20-22, but it's still slower than levels 1-19. Expect a difference of ~20%, though outcome vary depending on files, and only benchmark can measure exact impact.

Is it fair to use time zstd -d .. -o /dev/null for benchmarking?
If so, I see about 12% less decompression speed:

time zstd -d X.22.T3.zst -o /dev/null
X.22.T3.zst: 123965807088 bytes

real    1m49.732s
user    1m47.072s
sys     0m2.648s


time zstd -d X.11.zstd -o /dev/null
X.11.zstd: 123965807088 bytes

real    1m37.355s
user    1m34.536s
sys     0m2.812s

You can see that the lvl22 files was smaller (less time spent on sys), but the cpu had a lot more to do so overall it ran a bit slower.
Are there any 'obvious' improvements for the lvl22 decomp left on the table or can it be assumed that this performance differential will be around for the medium/long term? Any plans regarding multi-threaded decompression?

There are roughly three major factors in decompression speed, as measured in your benchmark.

  1. IO speed. It is possible that decompression runs faster than the OS can load the input file. The way I've gotten around this is to put the file in a RAM backed filesystem.
  2. The back reference distances. Matches further back in the history are more expensive because they are likely not in the cache. There has been work in reducing this overhead, and decompression speed for high compression levels should be improved starting in version 1.1.2.
  3. The compressibility. The mixture of raw literals, Huffman encoded literals, and matches drastically effects the decompression speed.

A 12% decompression speed difference between levels 11 and 22 is "reasonable".
Level 22 has a lot more cache misses to fight than level 11, so with more latency on his plate, it's bound to decompress slower.
That being said, we still have a few ideas to test, hence a future update might reduce this difference a bit more...

Thanks. Both files were fully loaded from RAM, so they were able to decompress at maximum 1-core speed. When retrieving the file from a much slower network backup, lvl 22 outperformed lvl 11 by about 6%.
And just out of curiosity, are there any plans for a multi threaded decompression?

Yes, we do know how to introduce this feature.
That being said, we currently lack a scenario which would benefit from it.
MT decompression is only useful for large files, and such use cases are generally I/O bounded.

I agree that it shouldn't have highest priority.

If it's useful, the use case would be decompressing large lvl22-compressed files from a fast SSD into RAM for processing (instead of writing it back to disk). Currently a single core from a top i7 or E3 will not be fast enough to keep up with the data being read from the SSD.

Anyhow, my questions have been answered, so I am closing the ticket. Thanks everyone!

Was this page helpful?
0 / 5 - 0 ratings