On my test set, zstd
--train -b# untrained trained
1 3.423 2.986
2 3.379 2.807
3 3.387 2.811
4 3.425 2.913
5 3.692 2.960
6 4.154 3.286
7 4.154 3.773
8 4.154 3.773
9 4.153 3.771
10 3.995 3.771
11 5.414 3.771
12 5.414 3.770
13 5.414 3.778
14 5.414 3.904
15 5.414 3.894
16 5.414 3.894
17 5.414 3.894
18 5.414 3.894
19 5.414 3.894
20 5.481 4.108
21 5.481 4.108
22 5.481 4.108
Test set (the first 10^7 primes)
$ mkdir -pv primes
$ for n in {0..1000};do primesieve $((10000*n)) $((10000*n+10000)) -p > primes/$n.dat;done
cross-check
$ for n in {1..22}; do zstd -b$n -r -T2 primes 2>&1 | grep -oP '\(\K[^\)]+' | head -1 | xargs echo $n; done
1 3.423
2 3.379
[...]
Training
$ zstd --train -r primes
Trying 5 different sets of parameters
k=1998
d=8
steps=4
Save dictionary of size 112640 into file dictionary
Evaluation
$ for n in {1..22}; do zstd -b$n -r -T2 primes -D dictionary 2>&1 | grep -oP '\(\K[^\)]+' | head -1 | xargs echo $n; done
1 2.986
2 2.807
[...]
```bash
$ zstd --version
* zstd command line interface 64-bits v1.3.4, by Yann Collet *
$ uname -a
Linux hostname 4.15.10-1-ARCH #1 SMP PREEMPT Thu Mar 15 12:24:34 UTC 2018 x86_64 GNU/Linux
6, we start analyzing the cost of using the dictionaries Huffman table compared to writing a new Huffman table, and choose the better option. However, we don't yet do this for FSE tables, which would improve the compression ratio for dictionaries significantly in this case.zstd --train -1 to get tables that are more adapted to the compression level used.I'll leave this issue open to track the FSE table cost estimation feature.
I don't fully understand what you wrote but I really appreciate your answer! Maybe I read into this when I have time.
Anyway, there is one thing I am confused about: Why is there such a big boost within the entropy stage? As far as I know, there is no known prime _generator_ yet and hence no big scheme behind it -- that lowers the entropy of the system -- but just elaborate sieve _methods_ (Note: Actually a really interesting question: Does the existence of exact and finitely terminating methods imply a general scheme?).
I'll have a look at the compression setting while training when I find the time. I did not know about the possible combination! Thank you!
First, the files only have 11 symbols [0, 9] and newline, so the Huffman tables should reflect their probabilities exactly to get the best compression.
More importantly, the matches the LZ portion emits will be very regular, since the files are very regular and sorted. For example, look at an excerpt from 100.dat:
1000003
1000033
1000037
1000039
1000081
1000099
1000117
1000121
1000133
1000151
There is nearly always a match of length exactly 6 exactly 8 bytes in the past. "\n10000" and "\n10001". 2.dat on the other hand will see matches of exactly length 4 exactly 6 bytes in the past. By using separate statistics for each file, we will get much better compression for each match.
The dictionary combines the statistics for all the files, which does much worse than having distinct statistics, because of the extremely regular nature of the data.
Eureka! Yes, that was my mental blackout. I thought about it from a mathematical/philosophical point of view but the numbers still need to be _implemented_ and then, yes, there is a huge overlap.
E.g. in your example -- without any statistics -- the transformation to 1000003, 30, 4, 2, 42, 18, ... would already dramatically reduce the size. I think this is what you had in mind.
Thank you! :-)
Edit
P.s.: Or in other words: The element of interest here is the entropy of the implementation what is dependent on the system it is implemented in. In the case of binary electronics, one ends up with Shannon entropy.
For levels >= 6 we now compare the cost of using the dictionary tables versus building new tables or using the default tables.