Zstd: dictionary: find similarities

Created on 15 Jun 2020  路  8Comments  路  Source: facebook/zstd

Is your feature request related to a problem? Please describe.
Each month, I've got about 1.2M files in JSON format. This JSON has two parts:

  • a fixed part, where the user only can select 1-6 values for each key and
  • some free text entries, where the user can enter whatever he/she/it wants.

I've checked the content of the dictionary after training (using 60k files) and there many free text elements, which only appear once or twice in the 1.2 million files. So these are exceptions aka statistical outliers. Finding and indexing outliers does not bring any real advantage.

Describe the solution you'd like
It would be great to find similarities in all the files and only use these to create the dictionary. So it would gain with keys and values, which are repeating over and over.

Describe alternatives you've considered
I'm going to write a script, which creates all possible combinations of the key and value pairs and feed this into --train.

question

Most helpful comment

if we know there are some long repeated sections in files, is there a way to force these into the dictionary?

It's always possible to generate your own dictionary content,
using a dedicated heuristic more adapted to specificities of target files to compress,
and then transform this selected content into a fully formed dictionary,
using ZDICT_finalizeDictionary().

All 8 comments

I presume you are using the default dictionary builder, which is relatively fast, and works by selecting promising "segments" from the batch of samples, which are known to be efficient "as a whole", but irrespective of the efficiency of individual bytes, making it possible to embed a few one-time snippets alongside more useful ones.

I believe there are advanced modes which are able to go down to byte accuracy, but this is more complex, slower, and not necessarily worth it (compression isn't hugely improved). Maybe @terrelln will be able to tell more.

@tobwen as @Cyan4973 mentioned, we select "segments" from the data of a fixed size k. We then score the segments based on the sum of the frequency of the substrings of length d < k, typically 6-8. The trainer by default tries multiple values for k, and selects the best dictionary.

This will select segments that contain useful pieces, but there will also be "noise" left in the dictionary. This is because sometimes leaving "holes" in the dictionary is helpful. We have a special offset called "repeat offsets". When two matches occur at the same offset, the second offset is much cheaper. So if your data looks like "KEY1":RAND5,"KEY2", where RAND5 is 5 random bytes, it is better to have "KEY1":XXXXX,"KEY2" than just "KEY1":,"KEY2", where XXXXX can be anything.

So building a dictionary isn't quite as simple as selecting the best content. You also have to think about how this content is arranged in the dictionary, and the "holes" you leave.

Okay, I thought, the dictionary really builds a lookup table of the most common values. Most of my 1.2 million files have 80% repeating and common data. The other is actually "random noise" (free text).

Just for fun (and research), I'll finish creating my synthetic file. I'll order and randomize the order of my fixed segments. I'll report results when I'm done.

Can anyone give me a hint, if it's possible (or useful) to run the training on all 1.2M files? _zcat_ always complained, so I needed to run it on 60k files only.

@tobwen you can use -r to train on a directory. However, it is very unlikely that training on 1.2m files will be better than training on 60k files. But it will certainly be slower :)

You can also try non-default parameters like --train-cover=steps=256 -T0. That might create a slightly better dictionary.

On a similar note, if we know there are some long repeated sections in files, is there a way to force these into the dictionary? The repeated sections are 10-50Kb in size so much larger than any traditional training setup would be able to find effectively.

These repeated sections make up ~70% of the total file size so being able to de-duplicate this data into single file / dictionary is important.

edit: To clarify, large amount of small files, with 70% of the size of each individual file made up from 10 or so identical byte sequences. Using normal training only gives 50% reduction in size. Hacking together something to manually de-duplicate the 10 byte sequences in the data and then training on the remaining portion gives reduces total size to ~20% of original size.

if we know there are some long repeated sections in files, is there a way to force these into the dictionary?

It's always possible to generate your own dictionary content,
using a dedicated heuristic more adapted to specificities of target files to compress,
and then transform this selected content into a fully formed dictionary,
using ZDICT_finalizeDictionary().

I attempted to do something like that (through a wrapper library) and ended up getting worse compression than the default training so obviously was doing something wrong. I need to dig into the zstd docs a little more so I'm not blindly trying to get it to work and make sure to experiment with the finalizeDictionary approach.

I appended all the sequences I knew were shared between the files and sent it to ZDICT_finalizeDictionary() as the dictionary contents and used a portion of the files as the sample input. The created dictionary was then used to compress all the files and reduced their total size down to ~22%. If I do a second pass with a "default trained" dictionary it reduces further to 19% which is the best I got using my custom approach. I assume I could combine results from the default training with my custom contents into a single dictionary to get the 19% with a single pass.

Thanks for the help, this will result in great compression without the overhead of maintaining the custom setup I was experimenting with.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ga92yup picture ga92yup  路  3Comments

vade picture vade  路  3Comments

robert3005 picture robert3005  路  4Comments

g666gle picture g666gle  路  3Comments

itsnotvalid picture itsnotvalid  路  3Comments