I am trying to use zstd compression api in my code with trained dictionary. As a first step I tried training the dictionary using zstd cli command with some files from the folder chosen for compression. Passed on the static trained dictionary to my code to compress all the files in the chosen folder.
I noticed that at default compression level, there is no difference with the compression savings. I am not sure if I am using the apis right or is it the samples chosen for training the dictionary?
How to choose right samples for training dictionaries?
Code snippet that loads static trained dictionary
static void
load_zstd_dictionary()
{
if (!dictionary_path) {
return;
}
ffd = open(dictionary_path, O_RDONLY);
if (ffd < 0) {
perror("Could not open dictionary file");
exit(1);
}
void* const dict_buf = malloc(dict_size);
if (!dict_buf){
perror("Dictionary malloc error");
close(ffd);
exit(1);
}
cdict = ZSTD_createCDict(dict_buf, dict_size, zstd_comp_level);
free(dict_buf);
if (cdict == NULL) {
close(ffd);
perror ("ZSTD_createCDict() failed");
exit(1);
}
close(ffd);
ffd = 0;
}
Snippet that compress input buffers of size 32KB
if (zstd_without_dictionary) {
max_out_len = ZSTD_compressBound(input_length);
output_length = ZSTD_compress((out->fb_buf + off), max_out_len,
in->fb_buf, input_length,
/*Compression level*/ zstd_comp_level);
} else {
ZSTD_CCtx* const cctx = ZSTD_createCCtx();
max_out_len = ZSTD_compressBound(input_length);
output_length = ZSTD_compress_usingCDict(cctx, (out->fb_buf + off), max_out_len,
in->fb_buf, input_length,
cdict);
ZSTD_freeCCtx(cctx);
}
Thanks in advance! # # #
I don't see you actually reading the dictionary in from the file. The snippet you've included looks like it creates a dictionary from an empty/uninitialized buffer. Could that be your issue?
I'm going to close this issue for now. If you want to follow up, please feel free to re-open.
I don't see you actually reading the dictionary in from the file. The snippet you've included looks like it creates a dictionary from an empty/uninitialized buffer. Could that be your issue?
Indeed thats the issue. I corrected it and tried. I am not seeing better compression saving with dictionary. Here is how I am trying:
Collect (100) samples each of size 32KB as the read IO progresses.
One sample per file or one sample per 10MB if the file is large.
Compress without dictionary until dictionary is trained.
Once max no of samples are collected, train the dictionary and use it for future compression.
With this approach, I saw almost the same compression saving with and without dictionary.
So, I extended this approach to create new dictionaries for every 200 files or for every 10MB of data compressed. I am seeing that the compression saving is dropping compared to that without dictionary.
Now, its not clear to me, what kind of samples and training frequency will actually help attain better savings. Or is there more to consider?
One sample per file or one sample per 10MB if the file is large.
Dictionaries are not useful for large files.
They only matter for small files (i.e a few KB large).
Do you mean large files don't provide good samples for dictionary training or they don't compress well with a good dictionary too?
It isn't clear to me yet. If I could consider a large file as regions of few KBs and use it for compression, why wouldn't it be helpful.? Also I am doing block compression where a block can be of size ranging from 4KB to 32KB. So, I don't really see the data as files but only as blocks which are few KBs.
To conclude, if I have only files of larger sizes say in MBs or GBs, doing dictionary based compression would not be useful even if I compress as blocks of size say 10KB?
if I have only files of larger sizes say in MBs or GBs, doing dictionary based compression would not be useful even if I compress as blocks of size say 10KB?
It depends on what you call a "block".
In zstd, a "block" is an internal subdivision of a "frame", but is not independent (blocks must be decoded in sequential order).
But if, in your nomenclature, a "block" is a completely independent piece of data, that can be randomly accessed and decoded independently of other blocks, then yes, it features (about) the same characteristics as a "small file", and dictionary compression seems applicable.
Note though that dictionary will help compression ratio compared to independent blocks without dictionary, but not to the point of providing better compression ratio than compressing the full file as a "single block". It's intermediate.
However, there is still an important difference.
Dictionary compression gives its best when compressing a lot of small messages of "similar" nature. This is important. A small message is likely to have a sort of "structure", such as a header, a few common fields, an order, etc. In contrast, a large file cut into small blocks will basically have no structure, it just starts and stops randomly in the flow. That means commonalities between blocks will be pretty weak, and as a consequence, dictionaries will be less efficient.
Most helpful comment
I don't see you actually reading the dictionary in from the file. The snippet you've included looks like it creates a dictionary from an empty/uninitialized buffer. Could that be your issue?