Zstd: Custom dictionary: "error 11 : Allocation error : not enough memory"

Created on 25 May 2020  路  6Comments  路  Source: facebook/zstd

When creating a custom dictionary I ran into something weird where I get the error above when trying to use it.

It is sort of clobbered together from a 'header' from another file and a custom payload, though I can't see why that would be an issue.

To Reproduce

  • Ungz this dictionary: dict-custom.gz
  • Use command: zstd -D dict-custom file.txt
  • zstd returns: zstd: error 11 : Allocation error : not enough memory. A 0 byte file is left behind.

Expected behavior

zstd should compress with the dictionary.

Desktop (please complete the following information):

  • OS: Windows 10, 64 bit.
  • Version: zstd command line interface 64-bits v1.4.5, by Yann Collet
  • Compiler: Released binary

Additional context

As I said this is a hacked together dictionary, but AFAICT it should work. If I truncate the dictionary it works.

I tried creating a bigger dictionary with zstd, but that works.

question

Most helpful comment

It should be easy for us to allow dictionaries with histograms with 0 weighted values on the encoder size. We have the ability to validate tables now, so we can use that while loading dictionaries.

However, I'd generally recommend making all literal lengths + match lengths + offsets <= 128 KB + dictionary content size representable. For higher compression levels the encoder will measure whether it is better to repeat tables (from the dictionary in this case), use the default tables, use RLE, or write its own tables. But faster compression levels use a heuristic that says "if the table can represent all possible symbols and I have less than 1000 sequences then prefer to use the dictionary's tables". So if you don't generate dictionaries with every symbol valid, you will miss this heuristic with the reference encoder.

All 6 comments

Tested : zstd -D dict-custom file.txt

../lib/compress/zstd_compress.c: ZSTD_compress_insertDictionary (dictSize=131252)
../lib/compress/zstd_compress.c:2895: ERROR!: check dictMaxSymbolValue < maxSymbolValue failed, returning ERROR(dictionary_corrupted): dict fse tables don't have all symbols
../lib/compress/zstd_compress.c:3019: ERROR!: forwarding error in ZSTD_checkDictNCount(offcodeNCount, offcodeMaxValue, ((offcodeMax)<(31) ? (offcodeMax) : (31))): Dictionary is corrupted:
../lib/compress/zstd_compress.c:3395: ERROR!: forwarding error in dictID: Dictionary is corrupted: ZSTD_compress_insertDictionary failed

The assembled dictionary does not seem valid.

If you want to generate a valid dictionary from your own content, I recommend using ZDICT_finalizeDictionary(), which is designed for this purpose (though it requires a set of samples).

We could probably improve the error message.
_edit_ : mmmh, that will be more difficult. The only message surfacing is createCDict() == NULL, aka a failure of creating a dictionary. It's translated into memory allocation error by default, because the higher layer doesn't know much more (only internal traces tell us more).

@Cyan4973 Compare it to this dictionary:
dict-custom-shorter.gz

The only difference is after the repeat codes and a shorter dictionary.

Unless there is an un-documented restriction that offset codes must be able to reference all parts of the dictionary this should be perfectly valid. Is there such a check?

I am generating my own dictionaries. Since I am only interested in the content of the dictionary for now, I am just slapping the same 'header' in front from a dictionary generated by zstd.

If I read the docs correctly I should be able to replace the dictionary content with anything I want. But I get this error if the content is longer than the dict I grabbed the header from.

A mention of in the specs is probably all that is needed to fix this (and probably a better error message - "Invalid dictionary" or something like that).

Unless there is an un-documented restriction that offset codes must be able to reference all parts of the dictionary this should be perfectly valid. Is there such a check?

It's not a format restriction,
but there is indeed such a check in the reference encoder.

What probably happens is this scenario :

  • you define your own custom dictionary, it's > 128 KB
  • the reference encoder notices that the first block (128 KB) + dictionary can lead to offsets > 256 KB.
  • It checks offset codes in the dictionary, to ensure that, if such a situation ever happens, it can blind-trust the dictionary's tables to represent the offset. It cannot, because the offsets were created for a different dictionary size (110 KB + 128 KB < 256 KB).
  • Consequently, it fails to load the dictionary

I don't think it's a spec problem. The spec doesn't forbid that.

It's more an implementation limitation. This situation would not happen when using ZDICT_finalizeDictionary(), because it would notice the size of the dictionary, and increase the range of offset codes accordingly. Consequently, this limitation has never been a problem up to now.

We could probably fix it, in a number of ways, from early discarding to later optimization, depending on how much runtime cpu resources we want to spend on the topic during encoding.

OK, it seems this is only an issue on the encoding side - the decoder doesn't complain.

It should be easy for us to allow dictionaries with histograms with 0 weighted values on the encoder size. We have the ability to validate tables now, so we can use that while loading dictionaries.

However, I'd generally recommend making all literal lengths + match lengths + offsets <= 128 KB + dictionary content size representable. For higher compression levels the encoder will measure whether it is better to repeat tables (from the dictionary in this case), use the default tables, use RLE, or write its own tables. But faster compression levels use a heuristic that says "if the table can represent all possible symbols and I have less than 1000 sequences then prefer to use the dictionary's tables". So if you don't generate dictionaries with every symbol valid, you will miss this heuristic with the reference encoder.

Was this page helpful?
0 / 5 - 0 ratings