Zstd: Document dictionary creation

Created on 20 Aug 2019 · 10Comments · Source: facebook/zstd

For example it is unclear why

default_dict_size = 110 << 10
zstd.train_dictionary(default_dict_size, [bytes(range(256))]*6)

fails with ZstdError: cannot train dict: Src size is incorrect.

More precisely I have no idea

what these chunks are and what is the sense of splitting binary into chunks. How should I split for example a tarball? By files boundaries? By fixed size? Or by what?
why even one large enough chunk is insufficient
why I have to have at least 7 chunks
why exactly 7 and not 100

The error message is itself arcane, we need more error codes and more messages explaining what exactly is wrong and how to fix that.

documentation question

Source

KOLANICH

👍1

All 10 comments

What is the language used in the 2-lines example ? Is that python ?

Cyan4973 on 20 Aug 2019

It is. zstandard library with some patches (removed a hardcoded version check and added 2 c source files) to use the latest zstd (the tests mostly pass, the fails look benign such as changes in error messages and changes in hashes of resulting streams).

KOLANICH on 20 Aug 2019

как правильно работать с zstd-v1.4.3-win64,есть файлы zst но не могу их раскодировать,программу скачал,но не знаю как пользоватся

waldemar161 on 26 Aug 2019

😕1 😄1

@waldemar161

your question is offtop in this issue.
write in English.
the format of the files may be legacy. Legacy format support is usually disabled, need to be enabled using a macrodef.

KOLANICH on 26 Aug 2019

Should this question be rather directed at the Python implementation offering this interface ?

Cyan4973 on 23 Oct 2019

I guess no. It just passes through the stuff done by a library.

ZstdError: cannot train dict: Src size is incorrect.

is a result of an error code returned by zstd lib. But it is unclear why it errors and what I should do about the errors.

KOLANICH on 23 Oct 2019

The libzstd API is quite different from the python prototype of your example code.
As a consequence, the equivalence between the two is not trivial, let alone the return value.
It must exists, it's just, someone has to read and understand the python wrapper code to clarify the relation between the two.

But even without that, there are a few striking points in your question.

what is the sense of splitting binary into chunks. How should I split for example a tarball?

That's not a correct thing to do.
If you are planning on compressing a tarball, then don't use dictionary training.
The outcome is most likely going to be negative.

Dictionary is only useful in the context of a _large quantity_ of small messages.
In which case, there is no chunking : each message is a sample.

Dictionary will save a few KB per message. That's a small gain ... except when messages are a few KB large.

But for a single big tarball, this is insignificant, and likely a lesser benefit than the cost of the dictionary itself.

why I have to have at least 7 chunks

I don't remember if it's 7 or more. But the previous definition should give a hint :
dictionaries are only useful to compress a lot of small messages.
If only a few samples are provided at training stage, it's a strong hint that this is not a valid use case for dictionary training.

Cyan4973 on 25 Oct 2019

I use zstd to compress records in an SQLite database. Currently I compress per-record with a shared dict and it saves the half of the size of the db compared to compression without a shared dict. The devil is what to do when I don't have enough records now but expect them to appear in future. Taking in account that dict training can fail complicates the code.

KOLANICH on 25 Oct 2019

I use zstd to compress records in an SQLite database. Currently I compress per-record with a shared dict and it saves the half of the size of the db

Yes, that's a good use case.

what to do when I don't have enough records now but expect them to appear in future

If there are not enough records to train a dictionary, then it's probably not the right time to train a dictionary. It's also a sign that there is not so much data to save.

If you "expect them to appear in future", you might be able to infer their shape / characteristics, and from there, produce a set of synthetic samples that "look like" these future records. Such samples are perfectly valid for training. Just, the efficiency of the dictionary will be tied to how "realistic" are the synthetic samples.

If by "complicate" you mean you want to always invoke ZSTD_compress_usingCDict() unconditionally, even for data categories that don't have a dictionary yet, note this API comment :
https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L816

A ZSTD_CDict can be created from an empty dictBuffer

It's possible to create a ZSTD_CDict from a 0-sized buffer. In which case, the only thing it transports is the compression level. This makes it possible to always invoke ZSTD_compress_usingCDict() with a ZSTD_CDict* argument.

Cyan4973 on 25 Oct 2019

👍1

Thanks for the info.

KOLANICH on 25 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings