Zstd: maxdict not being respected for values above 59042

Created on 21 Mar 2019  路  10Comments  路  Source: facebook/zstd

I'm using v1.3.8 from msys2, so if this has been fixed recently, feel free to close.

Training with --maxdict=59024 gives me a dictionary of size 59024, but training with --maxdict=59025 gives me a dictionary of size 8275.

All 10 comments

The maximum dictionary size is just the max, if the trainer can't find enough unique content to put in the dictionary, it will end early. Was that training on the same data both times?

If you can link me to the inputs you trained on, I can tell you why the dictionary trainer only selected a dictionary of 8275 bytes.

Yes it was on the same data, this looks like an overflow error, the k value drops to 50 with values above 59024, I'll do some more investigation later (file is 18GB), even if I specify really large values, it only comes out at a few KB.

I've been able to reproduce what seems like the same behavior. I'll look into it and let you know whats going on.

There is likely something we could improve here, as you say relating to maxdict and k. We break the source up into max(1, maxdict / k) epochs. We likely want to choose fewer epochs when maxdict is large compared to the source size.

However, I'd recommend training with at absolute minimum 10x the size of the dictionary, and if you can up to 100x the size of the dictionary. In general, the more data you provide, the better the dictionary is. If your dictionary is ending up too small, a quick fix is to simply train on more data.

I'll leave this issue open until I investigate more into this edge case, and see if it can be improved.

It's an 18GB file, split with -B32M so if my calculations are correct, its training set is 72MB, that should let me use a 720KB file for the "maximum", and 7.2MB for the "minimum", unless you mean per sample?

It just occurred to me that perhaps my use case isn't actually what the training mode is used for, I'm trying to compress a large file, and the dictionary isn't actually providing any benefits in compression size or speed (though it's already getting 19.91% ratio).

Yeah, dictionaries won't help on large files. It will only help when compressing small files.

I'll make sure to look into it before the next release.

I've encountered similar behaviour where increasing both the size of the dictionary and the number of training files resulted in a smaller dictionary.

On a related note: is there a recommended file size (or range of file sizes) for dictionary compression to perform well?

@shakeelrao dictionaries work best for small data from ~50 bytes to a few KB. Once you start to get in the 100 KB - 1 MB range, dictionary effectiveness dwindles. But, before 50 bytes we often can't find enough compression opportunities to offset the frame header.

You should train your dictionary on a sample of the files you are going to be compressing. The sum of the file sizes should be 10x to 100x the dictionary size (by default 110 KB).

I've put up PR #1556 to improve the behavior of the dictionary builder on small or homogeneous training corpora. Please let me know if that doesn't fix the behavior on your inputs.

I've merged the fix into the development branch. Please reopen the issue if you find that the situation is not improved for your use case.

Was this page helpful?
0 / 5 - 0 ratings