Consider the following fragments:
1.
ZSTD_CCtx* compressor = ZSTD_createCCtx();
ZSTD_initCStream(compressor, compression_level);
2.
ZSTD_CCtx* compressor = ZSTD_createCCtx();
ZSTD_CCtx_setParameter(compressor, ZSTD_c_compressionLevel, compression_level);
3.
ZSTD_CCtx* compressor = ZSTD_createCCtx();
ZSTD_parameters params =
ZSTD_getParams(compression_level, ZSTD_CONTENTSIZE_UNKNOWN, 0);
ZSTD_initCStream_advanced(compressor, nullptr, 0, params, ZSTD_CONTENTSIZE_UNKNOWN);
They look equivalent, and the first two in fact are. But the third yields very different compression settings in compression levels 1 and 2, and slightly different compression settings in higher levels.
There are two independent problems here.
The first problem causes a significantly weaker compression for levels 1 and 2 if ZSTD_initCStream_advanced is used to set the compression level. Here is my analysis what goes on (in zstd 1.4.0, but I believe nothing changed later):
ZSTD_initCStream_advanced calls ZSTD_assignParamsToCCtxParams.ZSTD_assignParamsToCCtxParams sets compressionLevel to ZSTD_CLEVEL_DEFAULT (i.e. 3) because it “should not matter, as all cParams are presumed properly defined”.ZSTD_compressStream2 calls ZSTD_getCParamsFromCCtxParams.ZSTD_getCParamsFromCCtxParams initializes cParams from compressionLevel, i.e. ZSTD_CLEVEL_DEFAULT. This includes setting cParams.targetLength to 1.ZSTD_getCParamsFromCCtxParams copies nonzero values from CCtxParams->cParams to cParams.CCtxParams->cParams.targetLength happens to be 0 for compression levels 1 and 2 (see ZSTD_getCParams), so it is not copied, and it is left at 1.ZSTD_disableLiteralsCompression to return 1 instead of 0.The first problem can be worked around by adding an apparently redundant ZSTD_CCtx_setParameter call after ZSTD_initCStream_advanced, thus specifying the compression level twice. I am not sure what should be the proper fix in zstd.
Here is the second problem. If ZSTD_getParams is used (either with ZSTD_initCStream_advanced or with ZSTD_CCtxParams_init_advanced + ZSTD_CCtx_setParametersUsingCCtxParams), it takes an estimatedSrcSize, which can be ZSTD_CONTENTSIZE_UNKNOWN. But if ZSTD_getParams is not used at all, the effect is different than specifying the estimatedSrcSize of ZSTD_CONTENTSIZE_UNKNOWN.
This can be observed if the first ZSTD_compressStream2 call is with ZSTD_e_end (i.e. zstd knows that the input is complete on the first streaming call). In this case compression settings are automatically tuned for the source size. This automatic tuning does not happen when ZSTD_getParams was used with ZSTD_CONTENTSIZE_UNKNOWN, in which case settings remain tuned for an infinite estimated size. I did not investigate further why this happens.
I can prepare a program which demonstrates these issues, but I do not have a standalone version ready (I only have one in the Google internal build system), so maybe this will not be needed if these descriptions are sufficient.
Hi @QrczakMK,
Thanks for the detailed write-up! Let me address the two issues you've identified separately.
In your second issue, you're describing differing behavior between (for example) these two snippets:
ZSTD_CCtx* cctx = ZSTD_createCCtx();
ZSTD_CCtx_reset(cctx, ZSTD_reset_session_and_parameters);
-ZSTD_CCtx_setParameter(cctx, ZSTD_c_compressionLevel, compressionLevel);
+ZSTD_parameters params = ZSTD_getParams(
+ compressionLevel, ZSTD_CONTENTSIZE_UNKNOWN, 0);
+ZSTD_CCtx_setZstdParams(cctx, params);
ZSTD_compress2(cctx, dst, dstSize, src, srcSize);
I can understand why you feel these ought to do the same thing. In reality though, you are giving different things to the compressor. With the former, you are giving it a desired compression level, asking the compressor to find and use the best internal params to accomplish that goal. In the latter, you are actually deriving values for those compression parameters and telling the compressor to use exactly those parameters (although there is actually a final step of adjustment that still happens when the input becomes known). So this precludes the compressor from internally doing that translation from compression level to compression parameters (through an internal call to ZSTD_getParams()) once it knows the input size.
All of which is to say, using ZSTD_CONTENTSIZE_UNKNOWN in ZSTD_getParams() does not leave the compression parameters unbound, it just selects defensive ones that aren't optimized for a particular input size.
Your first issue looks like a real bug though. Great analysis, by the way! I'm going to dig into this.
Addendum: this has the same problematic behavior as ZSTD_initCStream_advanced (and ZSTD_CCtx_setParameter works around the problem also here).
4.
ZSTD_CCtx* compressor = ZSTD_createCCtx();
ZSTD_parameters params =
ZSTD_getParams(compression_level, ZSTD_CONTENTSIZE_UNKNOWN, 0);
ZSTD_CCtx_params* ctx_params = ZSTD_createCCtxParams();
ZSTD_CCtxParams_init_advanced(ctx_params, params);
ZSTD_CCtx_setParametersUsingCCtxParams(compressor, ctx_params);
ZSTD_freeCCtxParams(ctx_params);
Yup, 4 is equivalent to 3, so that makes sense.
There is no such thing as ZSTD_CCtx_setZstdParams, although a comment in zstd.h claims that there is.
That is just meant to be pseudo code, since there is no public function that does it. I’ll put up a PR to clarify that documentation.
I am sorry but this bug is not fixed. I verified that on master. Please reopen the issue.
If I read the report correctly, there are 2 parts :
The first problem disables literals compression, resulting in worse compression ratio. This is indeed a problem. I'm surprised we did not noticed it, and we should first concentrate on confirming the bug and understanding why this issue evaded our tests. I presume a fix should be relatively straightforward in comparison to the test issue.
For the second "problem", I believe there is a misunderstanding.
The *_advanced() functions were designed as "debug" functions, and their semantic is slightly different from other "normal" compression functions.
In particular, most compression functions will try to adjust the meaning of "compression level" depending on the source size (if this information is accessible), while the *_advanced() functions will not.
In your examples, 1. and 2. are equivalent, because they just set a compression level, which meaning can be adapted by the compressor. But 3. is different, because it pins down compression parameters to some exact values, and these values cannot be adjusted afterwards.
Hence I'm afraid that what you are witnessing is not a bug, but the intended behavior.
Ultimately, this point is moot : *_advanced() functions are now redundant, and should no longer be used: they will be marked as deprecated on reaching v1.5.x serie. Reason is, there is now a much more powerful API available in the stable section, which is a superset of *_advanced(). This is ZSTD_CCtx_setParameter(), in combination with ZSTD_compress2() for one-shot compression, or in combination with ZSTD_compressStream().
Note that, similarly to your example 3., setting compression parameters with some exact values via ZSTD_CCtx_setParameter() will prevent automatic adjustment to take over.
We can probably improve the documentation to explain all this better.
Also : I read that the version used in your test was v1.4.0.
This could make a difference, compared to latest v1.4.3, regarding the first problem (literals compression disabled). To be tested.
The difference caused by the size being known is small, and it seems to manifest only if the first call of ZSTD_compressStream2 is with ZSTD_e_end, and you claim that this is by design, so I agree that this is likely not worth fixing.
The difference from literal compression is significant though: the compressed size can be twice larger (I have not measured the impact on compression speed).
There is likely a lot of code which uses ZSTD_initCStream_advanced() (in order to pass estimatedSrcSize), and which has no incentive for being rewritten to use the more modern API.
I verified with the master branch today that the problem remains there.
Issue confirmed.
It seems we are supposed to have some code to check this scenario, but the test itself was unfortunately buggy and missed it.
Both issues must be fixed.
Thanks for the great detailed description of this bug @QrczakMK .
It made the investigation easy, and allowed a quick fix.
We'll follow with a refactor of the test which was supposed to catch such issue (and missed it).
Thank you! I verified with today’s dev branch that this is fixed.