I have some other questions about dictionaries…
ZSTD_createCDict uses data generated by the cli "zstd --train", right?
Is there any incidence of the compression level used in --train?
With release 1.3.3, I get crashes after having called ZSTD_createCDict with a compression level >= 14; the crash occurs when I call ZSTD_compress_generic (heap-buffer-overflow in ZSTD_copyCCtx_internal zstd_compress.c:1043). Seems fixed in dev branch.
Hi @jeromectm
the compression level used in --train has a _small_ impact on compression ratio.
The main idea is that training will use the selected level instead of the default one, generating statistics more tuned for selected level. In our experience, the advantage is small (~1%), but still, if you know which compression level you are going to use in production, that's still good to take.
ZSTD_createCDict() used to fail in presence of specific training sets, either impossible to compress, or too easy to compress. These are generally "synthetic" samples, typically hand-made, featuring trivial patterns which are not representative of real data. As these fails where always associated with strange unrealistic training sets, it was fixed only recently.
That being said, ZSTD_createCDict() was supposed to fail, aka generate an error code, but not crash (like the heap buffer overflow you describe). I would love to access such a sample, even if it's fixed in dev branch, just to be sure the bug is perfectly well understood and entirely fixed.
Thanks @Cyan4973 so I will create a single dictionary using my default / typical compression level, and use it whatever level is actually used (otherwise I would have to keep track of which dictionary to use for decompressing).
You can find an asan (address sanitizer) report, the dictionary used, the files used to generate the dictionary, at https://services.ctmdev.com/private-builds/zstdbug .
Here is my code:
mCompressContext=ZSTD_createCCtx();
ZSTD_CCtx_setParameter(mCompressContext, ZSTD_p_nbThreads, 4);
ZSTD_CCtx_setParameter(mCompressContext, ZSTD_p_compressionLevel, 3);
ZSTD_CCtx_setParameter(mCompressContext, ZSTD_p_enableLongDistanceMatching, false);
ZSTD_CCtx_setParameter(mCompressContext, ZSTD_p_compressionLevel, 19);
mCompressDict=ZSTD_createCDict(dictData, dictSize, 19);
ZSTD_CCtx_refCDict(mCompressContext, mCompressDict);
ZSTD_compress_generic_simpleArgs(mCompressContext, outData, outDataLen, &outPos, inData, inDataLen, &inPos, ZSTD_e_end);
Thanks!
I wasn't able to reproduce the failure, my code is here.
Hi @terrelln , the crash does not occur for me either when compressing the rfc44.tar file; but it does when compressing rfc1.txt, that I added to my report. I did not expect the problem to depend of the data to compress as it seems to happen in the initialization stage, but it seems to be dependent of dataSize and / or outSize=ZSTD_compressBound(dataSize).
Thanks for the report @jeromectm, I've reproduced the issue and I'm looking into it.
Most helpful comment
Hi @terrelln , the crash does not occur for me either when compressing the rfc44.tar file; but it does when compressing rfc1.txt, that I added to my report. I did not expect the problem to depend of the data to compress as it seems to happen in the initialization stage, but it seems to be dependent of dataSize and / or outSize=ZSTD_compressBound(dataSize).