The compression ratio for stdin can be worse than for the corresponding file, e.g.
> cat j000 | zstd -14 -f -o a
/*stdin*\ : 16.54% ( 75885 => 12549 bytes, a)
> zstd -14 j000
j000 : 15.51% ( 75885 => 11767 bytes, j000.zst)
Is this expected? If so, this should be mentioned in the man page.
This can indeed happen, most notably for small files.
The problem is as follows :
stdin, zstd has no way to know how much data it will receive. It will discover it when the pipe ends.zstd is obliged to "bet".stdin), zstd will bet that the stream is "probably" "large". When it's not, then the bet is off, and some of the parameters are wrongly optimized. This impacts compression ratio.Agreed that this should be mentioned in the documentation.
Thanks for your explanation. Maybe a --size-hint option could help in cases where the difference is significant (like 5-10 %) and matters to the user.
I doubt addition parameter --size needed
Reason:
For large size of input stream from stdin compression ratio is already optimal, as zstd bet on large data size
For small data size, difference of compression ration does not really matter, as it can be few hundred bytes at most. Difference is too small to be significant.
Yes, the absolute difference in bytes for one file is no problem. But the relative difference of 5-10 % can be relevant, e.g. I have some systems with 30 million small zst files and currently I "lost" over 100 GB of SSD space because I called zstd on stdin instead of a file (because it was more convenient in my context).
I understand that, in your context, providing input data through stdin is more convenient.
What about the size of this data ? Do you happen to know it nonetheless ? Or are you constrained to "roughly guess" it ?
With some (justifiable) effort, I know the exact size in bytes as discussed for issue #1726. A rough guess option --stream-size-hint would be nice, too, but I understand that this can come after the --stream-size option.