Zstd: Compare cost of dictionary FSE table against new tables

Created on 7 Apr 2018  路  5Comments  路  Source: facebook/zstd

On my test set, zstd

  • works considerably worse after --train and
  • segfaults using training dictionary from compression level 14 upwards Gone after update from 1.3.3 -> 1.3.4
-b#  untrained  trained
1   3.423  2.986
2   3.379  2.807
3   3.387  2.811
4   3.425  2.913
5   3.692  2.960
6   4.154  3.286
7   4.154  3.773
8   4.154  3.773
9   4.153  3.771
10  3.995  3.771
11  5.414  3.771
12  5.414  3.770
13  5.414  3.778
14  5.414  3.904
15  5.414  3.894
16  5.414  3.894
17  5.414  3.894
18  5.414  3.894
19  5.414  3.894
20  5.481  4.108
21  5.481  4.108
22  5.481  4.108

Test set (the first 10^7 primes)

$ mkdir -pv primes
$ for n in {0..1000};do primesieve $((10000*n)) $((10000*n+10000)) -p > primes/$n.dat;done

cross-check

$ for n in {1..22}; do zstd -b$n -r -T2 primes 2>&1 | grep -oP '\(\K[^\)]+' | head -1 | xargs echo $n; done
1 3.423
2 3.379
[...]

Training

$ zstd --train -r primes
Trying 5 different sets of parameters                                          
k=1998                                                                         
d=8
steps=4
Save dictionary of size 112640 into file dictionary

Evaluation

$ for n in {1..22}; do zstd -b$n -r -T2 primes -D dictionary 2>&1 | grep -oP '\(\K[^\)]+' | head -1 | xargs echo $n; done
1 2.986
2 2.807
[...]

```bash
$ zstd --version
* zstd command line interface 64-bits v1.3.4, by Yann Collet *
$ uname -a
Linux hostname 4.15.10-1-ARCH #1 SMP PREEMPT Thu Mar 15 12:24:34 UTC 2018 x86_64 GNU/Linux

enhancement question

All 5 comments

  • The segfault is almost certainly the same as Issue #1004, which was indeed fixed in version 1.3.4.
  • The loss is coming from the entropy stage. The dictionaries built-in tables aren't as well adapted to the source as the ones built without a dictionary. Since this source is so regular, the entropy stage gives a large boost to compression ratio, and using slightly wrong statistics really hurts.
  • At compression level 6, we start analyzing the cost of using the dictionaries Huffman table compared to writing a new Huffman table, and choose the better option. However, we don't yet do this for FSE tables, which would improve the compression ratio for dictionaries significantly in this case.
  • You can also specify the compression level during training zstd --train -1 to get tables that are more adapted to the compression level used.

I'll leave this issue open to track the FSE table cost estimation feature.

I don't fully understand what you wrote but I really appreciate your answer! Maybe I read into this when I have time.
Anyway, there is one thing I am confused about: Why is there such a big boost within the entropy stage? As far as I know, there is no known prime _generator_ yet and hence no big scheme behind it -- that lowers the entropy of the system -- but just elaborate sieve _methods_ (Note: Actually a really interesting question: Does the existence of exact and finitely terminating methods imply a general scheme?).

I'll have a look at the compression setting while training when I find the time. I did not know about the possible combination! Thank you!

First, the files only have 11 symbols [0, 9] and newline, so the Huffman tables should reflect their probabilities exactly to get the best compression.

More importantly, the matches the LZ portion emits will be very regular, since the files are very regular and sorted. For example, look at an excerpt from 100.dat:

1000003
1000033
1000037
1000039
1000081
1000099
1000117
1000121
1000133
1000151

There is nearly always a match of length exactly 6 exactly 8 bytes in the past. "\n10000" and "\n10001". 2.dat on the other hand will see matches of exactly length 4 exactly 6 bytes in the past. By using separate statistics for each file, we will get much better compression for each match.

The dictionary combines the statistics for all the files, which does much worse than having distinct statistics, because of the extremely regular nature of the data.

Eureka! Yes, that was my mental blackout. I thought about it from a mathematical/philosophical point of view but the numbers still need to be _implemented_ and then, yes, there is a huge overlap.
E.g. in your example -- without any statistics -- the transformation to 1000003, 30, 4, 2, 42, 18, ... would already dramatically reduce the size. I think this is what you had in mind.

Thank you! :-)


Edit
P.s.: Or in other words: The element of interest here is the entropy of the implementation what is dependent on the system it is implemented in. In the case of binary electronics, one ends up with Shannon entropy.

For levels >= 6 we now compare the cost of using the dictionary tables versus building new tables or using the default tables.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ga92yup picture ga92yup  路  3Comments

robert3005 picture robert3005  路  4Comments

sergeevabc picture sergeevabc  路  3Comments

planet36 picture planet36  路  3Comments

TheSil picture TheSil  路  3Comments