For example it is unclear why
default_dict_size = 110 << 10
zstd.train_dictionary(default_dict_size, [bytes(range(256))]*6)
fails with ZstdError: cannot train dict: Src size is incorrect.
More precisely I have no idea
The error message is itself arcane, we need more error codes and more messages explaining what exactly is wrong and how to fix that.
What is the language used in the 2-lines example ? Is that python ?
It is. zstandard library with some patches (removed a hardcoded version check and added 2 c source files) to use the latest zstd (the tests mostly pass, the fails look benign such as changes in error messages and changes in hashes of resulting streams).
как правильно работать с zstd-v1.4.3-win64,есть файлы zst но не могу их раскодировать,программу скачал,но не знаю как пользоватся
@waldemar161
Should this question be rather directed at the Python implementation offering this interface ?
I guess no. It just passes through the stuff done by a library.
ZstdError: cannot train dict: Src size is incorrect.
is a result of an error code returned by zstd lib. But it is unclear why it errors and what I should do about the errors.
The libzstd API is quite different from the python prototype of your example code.
As a consequence, the equivalence between the two is not trivial, let alone the return value.
It must exists, it's just, someone has to read and understand the python wrapper code to clarify the relation between the two.
But even without that, there are a few striking points in your question.
what is the sense of splitting binary into chunks. How should I split for example a tarball?
That's not a correct thing to do.
If you are planning on compressing a tarball, then don't use dictionary training.
The outcome is most likely going to be negative.
Dictionary is only useful in the context of a _large quantity_ of small messages.
In which case, there is no chunking : each message is a sample.
Dictionary will save a few KB per message. That's a small gain ... except when messages are a few KB large.
But for a single big tarball, this is insignificant, and likely a lesser benefit than the cost of the dictionary itself.
why I have to have at least 7 chunks
I don't remember if it's 7 or more. But the previous definition should give a hint :
dictionaries are only useful to compress a lot of small messages.
If only a few samples are provided at training stage, it's a strong hint that this is not a valid use case for dictionary training.
I use zstd to compress records in an SQLite database. Currently I compress per-record with a shared dict and it saves the half of the size of the db compared to compression without a shared dict. The devil is what to do when I don't have enough records now but expect them to appear in future. Taking in account that dict training can fail complicates the code.
I use zstd to compress records in an SQLite database. Currently I compress per-record with a shared dict and it saves the half of the size of the db
Yes, that's a good use case.
what to do when I don't have enough records now but expect them to appear in future
If there are not enough records to train a dictionary, then it's probably not the right time to train a dictionary. It's also a sign that there is not so much data to save.
If you "expect them to appear in future", you might be able to infer their shape / characteristics, and from there, produce a set of synthetic samples that "look like" these future records. Such samples are perfectly valid for training. Just, the efficiency of the dictionary will be tied to how "realistic" are the synthetic samples.
If by "complicate" you mean you want to always invoke ZSTD_compress_usingCDict() unconditionally, even for data categories that don't have a dictionary yet, note this API comment :
https://github.com/facebook/zstd/blob/dev/lib/zstd.h#L816
A ZSTD_CDict can be created from an empty dictBuffer
It's possible to create a ZSTD_CDict from a 0-sized buffer. In which case, the only thing it transports is the compression level. This makes it possible to always invoke ZSTD_compress_usingCDict() with a ZSTD_CDict* argument.
Thanks for the info.