When creating a custom dictionary I ran into something weird where I get the error above when trying to use it.
It is sort of clobbered together from a 'header' from another file and a custom payload, though I can't see why that would be an issue.
To Reproduce
zstd -D dict-custom file.txtzstd: error 11 : Allocation error : not enough memory. A 0 byte file is left behind.Expected behavior
zstd should compress with the dictionary.
Desktop (please complete the following information):
zstd command line interface 64-bits v1.4.5, by Yann ColletAdditional context
As I said this is a hacked together dictionary, but AFAICT it should work. If I truncate the dictionary it works.
I tried creating a bigger dictionary with zstd, but that works.
Tested : zstd -D dict-custom file.txt
../lib/compress/zstd_compress.c: ZSTD_compress_insertDictionary (dictSize=131252)
../lib/compress/zstd_compress.c:2895: ERROR!: check dictMaxSymbolValue < maxSymbolValue failed, returning ERROR(dictionary_corrupted): dict fse tables don't have all symbols
../lib/compress/zstd_compress.c:3019: ERROR!: forwarding error in ZSTD_checkDictNCount(offcodeNCount, offcodeMaxValue, ((offcodeMax)<(31) ? (offcodeMax) : (31))): Dictionary is corrupted:
../lib/compress/zstd_compress.c:3395: ERROR!: forwarding error in dictID: Dictionary is corrupted: ZSTD_compress_insertDictionary failed
The assembled dictionary does not seem valid.
If you want to generate a valid dictionary from your own content, I recommend using ZDICT_finalizeDictionary(), which is designed for this purpose (though it requires a set of samples).
We could probably improve the error message.
_edit_ : mmmh, that will be more difficult. The only message surfacing is createCDict() == NULL, aka a failure of creating a dictionary. It's translated into memory allocation error by default, because the higher layer doesn't know much more (only internal traces tell us more).
@Cyan4973 Compare it to this dictionary:
dict-custom-shorter.gz
The only difference is after the repeat codes and a shorter dictionary.
Unless there is an un-documented restriction that offset codes must be able to reference all parts of the dictionary this should be perfectly valid. Is there such a check?
I am generating my own dictionaries. Since I am only interested in the content of the dictionary for now, I am just slapping the same 'header' in front from a dictionary generated by zstd.
If I read the docs correctly I should be able to replace the dictionary content with anything I want. But I get this error if the content is longer than the dict I grabbed the header from.
A mention of in the specs is probably all that is needed to fix this (and probably a better error message - "Invalid dictionary" or something like that).
Unless there is an un-documented restriction that offset codes must be able to reference all parts of the dictionary this should be perfectly valid. Is there such a check?
It's not a format restriction,
but there is indeed such a check in the reference encoder.
What probably happens is this scenario :
I don't think it's a spec problem. The spec doesn't forbid that.
It's more an implementation limitation. This situation would not happen when using ZDICT_finalizeDictionary(), because it would notice the size of the dictionary, and increase the range of offset codes accordingly. Consequently, this limitation has never been a problem up to now.
We could probably fix it, in a number of ways, from early discarding to later optimization, depending on how much runtime cpu resources we want to spend on the topic during encoding.
OK, it seems this is only an issue on the encoding side - the decoder doesn't complain.
It should be easy for us to allow dictionaries with histograms with 0 weighted values on the encoder size. We have the ability to validate tables now, so we can use that while loading dictionaries.
However, I'd generally recommend making all literal lengths + match lengths + offsets <= 128 KB + dictionary content size representable. For higher compression levels the encoder will measure whether it is better to repeat tables (from the dictionary in this case), use the default tables, use RLE, or write its own tables. But faster compression levels use a heuristic that says "if the table can represent all possible symbols and I have less than 1000 sequences then prefer to use the dictionary's tables". So if you don't generate dictionaries with every symbol valid, you will miss this heuristic with the reference encoder.
Most helpful comment
It should be easy for us to allow dictionaries with histograms with 0 weighted values on the encoder size. We have the ability to validate tables now, so we can use that while loading dictionaries.
However, I'd generally recommend making all literal lengths + match lengths + offsets <= 128 KB + dictionary content size representable. For higher compression levels the encoder will measure whether it is better to repeat tables (from the dictionary in this case), use the default tables, use RLE, or write its own tables. But faster compression levels use a heuristic that says "if the table can represent all possible symbols and I have less than 1000 sequences then prefer to use the dictionary's tables". So if you don't generate dictionaries with every symbol valid, you will miss this heuristic with the reference encoder.