Zstd: Compression ratios differ between file and stdin

Created on 15 Aug 2019  路  6Comments  路  Source: facebook/zstd

The compression ratio for stdin can be worse than for the corresponding file, e.g.

> cat j000 | zstd -14 -f -o a
/*stdin*\            : 16.54%   ( 75885 =>  12549 bytes, a)
> zstd -14 j000
j000                 : 15.51%   ( 75885 =>  11767 bytes, j000.zst)

Is this expected? If so, this should be mentioned in the man page.

documentation feature request question

All 6 comments

This can indeed happen, most notably for small files.

The problem is as follows :

  • When providing input from stdin, zstd has no way to know how much data it will receive. It will discover it when the pipe ends.
  • A number of parameters can be optimized based on source size. They change compression ratio and initialization time. Without this information, zstd is obliged to "bet".
  • For a standard streaming scenario (input coming from stdin), zstd will bet that the stream is "probably" "large". When it's not, then the bet is off, and some of the parameters are wrongly optimized. This impacts compression ratio.
  • The exact outcome vary, depending on source file, size, and compression level. It's generally within noise level, although for some small files, it can become more visible.

Agreed that this should be mentioned in the documentation.

Thanks for your explanation. Maybe a --size-hint option could help in cases where the difference is significant (like 5-10 %) and matters to the user.

I doubt addition parameter --size needed
Reason:
For large size of input stream from stdin compression ratio is already optimal, as zstd bet on large data size
For small data size, difference of compression ration does not really matter, as it can be few hundred bytes at most. Difference is too small to be significant.

Yes, the absolute difference in bytes for one file is no problem. But the relative difference of 5-10 % can be relevant, e.g. I have some systems with 30 million small zst files and currently I "lost" over 100 GB of SSD space because I called zstd on stdin instead of a file (because it was more convenient in my context).

I understand that, in your context, providing input data through stdin is more convenient.

What about the size of this data ? Do you happen to know it nonetheless ? Or are you constrained to "roughly guess" it ?

With some (justifiable) effort, I know the exact size in bytes as discussed for issue #1726. A rough guess option --stream-size-hint would be nice, too, but I understand that this can come after the --stream-size option.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rgdoliveira picture rgdoliveira  路  3Comments

scherepanov picture scherepanov  路  3Comments

escalade picture escalade  路  3Comments

g666gle picture g666gle  路  3Comments

planet36 picture planet36  路  3Comments