Division by zero exception instead of ZSTD_error_dstSize_tooSmall in COVER_buildDictionary at https://github.com/facebook/zstd/blob/dev/lib/dictBuilder/cover.c#L611 in case of small maxdict param value provided
zstd --train sample.json --maxdict=1024
sample.json
['a': 'constant_field', 'b': '1615890720', 'c': 1068041704, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1987910979', 'c': 1136274312, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '458354839', 'c': 752791499, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1345300048', 'c': 1808022441, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '317792882', 'c': 1971021450, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1083291535', 'c': 365688543, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1505090195', 'c': 683584065, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1896415458', 'c': 941930511, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '597682527', 'c': 1729893465, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '416102126', 'c': 660877617, 'd': 'sometimes constant field']
output
ZDICT_optimizeTrainFromBuffer_cover:
kSteps 4
kStepSize 487
kIterations 5
Trying 5 different sets of parameters
#d 8
#k 50
COVER_buildDictionary: dictBufferCapacity 1024
COVER_buildDictionary: parameters.k 50
COVER_buildDictionary: epochs 20
statistics ...
HUF_writeCTable error
Failed to finalize dictionary
#k 537
COVER_buildDictionary: dictBufferCapacity 1024
COVER_buildDictionary: parameters.k 537
COVER_buildDictionary: epochs 1
statistics ...
HUF_writeCTable error
Failed to finalize dictionary
#k 1024
COVER_buildDictionary: dictBufferCapacity 1024
COVER_buildDictionary: parameters.k 1024
COVER_buildDictionary: epochs 1
statistics ...
HUF_writeCTable error
Failed to finalize dictionary
#k 1511
COVER_buildDictionary: dictBufferCapacity 1024
COVER_buildDictionary: parameters.k 1511
COVER_buildDictionary: epochs 0
Thanks for the report, it will be fixed.
In the meantime, to train a dictionary of size X, you should have at least 10X data, preferably 100X. It's not a strict rule, but generally the more data you have, the better the dictionary will be, and after 100X, the returns are diminishing. Additionally, the dictionary builder expects each sample in its own file. You can put each ['a': 'constant_field', 'b': '1615890720', 'c': 1068041704, 'd': 'sometimes constant field'] in its own x.json file, put them in a directory and use the -r flag.
Could you please give a small clarification about the difference between cover and legacy dict training?
It is a different algorithm for choosing the content of the dictionary that generally produces better dictionaries. They will both produce dictionaries in the same format.
On the command line both dictionary builders need the samples to be separated into one sample per file, so they can properly measure the effectiveness of the dictionary. In the library, the dictionary training algorithms need to know where each sample begins for the same reason.
Thanks for the fix! But here is an another strange change in behavior I detected.
Legacy:
> zstd -r --train-legacy samples
Save dictionary of size 668 into file dictionary
> zstd -D dictionary sample.json -o sample.json.zst
sample.json : 17.14% ( 910 => 156 bytes, sample.json.zst)
Default:
> zstd -r --train samples
k=1998; d=8; steps=4
Save dictionary of size 112640 into file dictionary
> zstd -D dictionary sample.json -o sample.json.zst
sample.json : 28.79% ( 910 => 262 bytes, sample.json.zst)
Without dict:
> zstd sample.json -o sample.json.zst
sample.json : 24.95% ( 910 => 227 bytes, sample.json.zst)
Legacy training shows reasonable dict size, whilst cover training creates dict of max capacity. It turns out that now on small sets of samples ones should accurately choose maxdict param to get a good compression level?
Samples:
samples/00:
['a': 'constant_field', 'b': '1615890720', 'c': 1068041704, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1987910979', 'c': 1136274312, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '458354839', 'c': 752791499, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1345300048', 'c': 1808022441, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '317792882', 'c': 1971021450, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1083291535', 'c': 365688543, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1505090195', 'c': 683584065, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1896415458', 'c': 941930511, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '597682527', 'c': 1729893465, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '416102126', 'c': 660877617, 'd': 'sometimes constant field']
samples/01:
['a': 'constant_field', 'b': '139144628', 'c': 1125602140, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1608708305', 'c': 1202808157, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '753323514', 'c': 46326616, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1085246858', 'c': 1021589995, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '872993270', 'c': 978358653, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '490837366', 'c': 956817177, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1308131035', 'c': 288934220, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '204141825', 'c': 1251940683, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '256480753', 'c': 1866958909, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1355425061', 'c': 1390228452, 'd': 'sometimes constant field']
samples/02:
['a': 'constant_field', 'b': '1911822216', 'c': 42252713, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1165397557', 'c': 1206155420, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '909480084', 'c': 856568224, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1280463357', 'c': 329968021, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1832937087', 'c': 419843337, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1160703904', 'c': 1023422188, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1295978151', 'c': 106845161, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1558123818', 'c': 1255454738, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '174987515', 'c': 2082486450, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '111120650', 'c': 1352227552, 'd': 'sometimes constant field']
samples/03:
['a': 'constant_field', 'b': '1849642674', 'c': 1545382100, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '454881370', 'c': 1320908289, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '412566215', 'c': 1855079360, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '2073854819', 'c': 1728840929, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '576662446', 'c': 27667678, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1922250347', 'c': 518688385, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '499784169', 'c': 1243260496, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '162539335', 'c': 59446910, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '780788206', 'c': 353698398, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1867454132', 'c': 990410042, 'd': 'sometimes constant field']
samples/04:
['a': 'constant_field', 'b': '808758844', 'c': 798359434, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1432364045', 'c': 1578304330, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1663263763', 'c': 1378055717, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '830140435', 'c': 748137689, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '191368625', 'c': 1369606979, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '672827335', 'c': 981461372, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '412977198', 'c': 400220815, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1189047758', 'c': 1758820128, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '639796174', 'c': 1687103339, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1667682167', 'c': 1821576811, 'd': 'sometimes constant field']
sample.json
['a': 'constant_field', 'b': '857019877', 'c': 1923929452, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '2033103372', 'c': 728933312, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1735149371', 'c': 1118261193, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '671215230', 'c': 896311924, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '2034237595', 'c': 997536529, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1479944834', 'c': 1257271596, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '176175054', 'c': 269330267, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1274610059', 'c': 1168413542, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1372225755', 'c': 459817170, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '609547231', 'c': 845362304, 'd': 'sometimes constant field']
Dictionary looks like this:
['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a'...
I'll look into it and see why exactly that is happening. It has all the "good" content at the end of the dictionary as it should, but it should stop adding segments to the dictionary.
However, the general rule of thumb is if you want to train a dictionary of size X, you want to have 100X bytes of data. Otherwise, there just isn't enough data to construct a good dictionary.
I have a fix for the end condition, but there is a separate issue of overfitting when optimizing the parameters. Immediately we can add the dictionary size to the compressed size when evaluating dictionaries. We probably also want to support having separate training/test data.
With pull #811 applied, I get:
./zstd -r --train samples
! Warning : nb of samples too low for proper processing !
! Please provide _one file per sample_.
! Do not concatenate samples together into a single file,
! as dictBuilder will be unable to find the beginning of each sample,
! resulting in poor dictionary quality.
Trying 5 different sets of parameters
k=1024
d=8
steps=4
Save dictionary of size 303 into file dictionary
and the result is
./zstd -D dictionary sample.json -c > /dev/null
sample.json : 17.01% ( 911 => 155 bytes, /*stdout*\)
Most helpful comment
I have a fix for the end condition, but there is a separate issue of overfitting when optimizing the parameters. Immediately we can add the dictionary size to the compressed size when evaluating dictionaries. We probably also want to support having separate training/test data.