Zstd: Division by zero on dict training

Created on 14 Aug 2017 · 7Comments · Source: facebook/zstd

Division by zero exception instead of ZSTD_error_dstSize_tooSmall in COVER_buildDictionary at https://github.com/facebook/zstd/blob/dev/lib/dictBuilder/cover.c#L611 in case of small maxdict param value provided

zstd --train sample.json --maxdict=1024

sample.json

['a': 'constant_field', 'b': '1615890720', 'c': 1068041704, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1987910979', 'c': 1136274312, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '458354839', 'c': 752791499, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1345300048', 'c': 1808022441, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '317792882', 'c': 1971021450, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1083291535', 'c': 365688543, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1505090195', 'c': 683584065, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1896415458', 'c': 941930511, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '597682527', 'c': 1729893465, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '416102126', 'c': 660877617, 'd': 'sometimes constant field']

output

ZDICT_optimizeTrainFromBuffer_cover:
kSteps 4
kStepSize 487
kIterations 5

Trying 5 different sets of parameters
#d 8
#k 50
COVER_buildDictionary: dictBufferCapacity 1024
COVER_buildDictionary: parameters.k 50
COVER_buildDictionary: epochs 20


statistics ... 
HUF_writeCTable error 
Failed to finalize dictionary
#k 537
COVER_buildDictionary: dictBufferCapacity 1024
COVER_buildDictionary: parameters.k 537
COVER_buildDictionary: epochs 1


statistics ... 
HUF_writeCTable error 
Failed to finalize dictionary
#k 1024
COVER_buildDictionary: dictBufferCapacity 1024
COVER_buildDictionary: parameters.k 1024
COVER_buildDictionary: epochs 1


statistics ... 
HUF_writeCTable error 
Failed to finalize dictionary
#k 1511
COVER_buildDictionary: dictBufferCapacity 1024
COVER_buildDictionary: parameters.k 1511
COVER_buildDictionary: epochs 0

Source

dscheg

👍1

Most helpful comment

I have a fix for the end condition, but there is a separate issue of overfitting when optimizing the parameters. Immediately we can add the dictionary size to the compressed size when evaluating dictionaries. We probably also want to support having separate training/test data.

terrelln on 21 Aug 2017

👍2

All 7 comments

Thanks for the report, it will be fixed.

In the meantime, to train a dictionary of size X, you should have at least 10X data, preferably 100X. It's not a strict rule, but generally the more data you have, the better the dictionary will be, and after 100X, the returns are diminishing. Additionally, the dictionary builder expects each sample in its own file. You can put each ['a': 'constant_field', 'b': '1615890720', 'c': 1068041704, 'd': 'sometimes constant field'] in its own x.json file, put them in a directory and use the -r flag.

terrelln on 14 Aug 2017

Could you please give a small clarification about the difference between cover and legacy dict training?

dscheg on 18 Aug 2017

It is a different algorithm for choosing the content of the dictionary that generally produces better dictionaries. They will both produce dictionaries in the same format.

On the command line both dictionary builders need the samples to be separated into one sample per file, so they can properly measure the effectiveness of the dictionary. In the library, the dictionary training algorithms need to know where each sample begins for the same reason.

terrelln on 21 Aug 2017

Thanks for the fix! But here is an another strange change in behavior I detected.

Legacy:

> zstd -r --train-legacy samples
Save dictionary of size 668 into file dictionary

> zstd -D dictionary sample.json -o sample.json.zst
sample.json          : 17.14%   (   910 =>    156 bytes, sample.json.zst)

Default:

> zstd -r --train samples
k=1998; d=8; steps=4
Save dictionary of size 112640 into file dictionary

> zstd -D dictionary sample.json -o sample.json.zst
sample.json          : 28.79%   (   910 =>    262 bytes, sample.json.zst)

Without dict:

> zstd sample.json -o sample.json.zst
sample.json          : 24.95%   (   910 =>    227 bytes, sample.json.zst)

Legacy training shows reasonable dict size, whilst cover training creates dict of max capacity. It turns out that now on small sets of samples ones should accurately choose maxdict param to get a good compression level?

Samples:

samples/00:
['a': 'constant_field', 'b': '1615890720', 'c': 1068041704, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1987910979', 'c': 1136274312, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '458354839', 'c': 752791499, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1345300048', 'c': 1808022441, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '317792882', 'c': 1971021450, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1083291535', 'c': 365688543, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1505090195', 'c': 683584065, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1896415458', 'c': 941930511, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '597682527', 'c': 1729893465, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '416102126', 'c': 660877617, 'd': 'sometimes constant field']

samples/01:
['a': 'constant_field', 'b': '139144628', 'c': 1125602140, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1608708305', 'c': 1202808157, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '753323514', 'c': 46326616, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1085246858', 'c': 1021589995, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '872993270', 'c': 978358653, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '490837366', 'c': 956817177, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1308131035', 'c': 288934220, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '204141825', 'c': 1251940683, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '256480753', 'c': 1866958909, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1355425061', 'c': 1390228452, 'd': 'sometimes constant field']

samples/02:
['a': 'constant_field', 'b': '1911822216', 'c': 42252713, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1165397557', 'c': 1206155420, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '909480084', 'c': 856568224, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1280463357', 'c': 329968021, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1832937087', 'c': 419843337, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1160703904', 'c': 1023422188, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1295978151', 'c': 106845161, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1558123818', 'c': 1255454738, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '174987515', 'c': 2082486450, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '111120650', 'c': 1352227552, 'd': 'sometimes constant field']

samples/03:
['a': 'constant_field', 'b': '1849642674', 'c': 1545382100, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '454881370', 'c': 1320908289, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '412566215', 'c': 1855079360, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '2073854819', 'c': 1728840929, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '576662446', 'c': 27667678, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1922250347', 'c': 518688385, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '499784169', 'c': 1243260496, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '162539335', 'c': 59446910, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '780788206', 'c': 353698398, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1867454132', 'c': 990410042, 'd': 'sometimes constant field']

samples/04:
['a': 'constant_field', 'b': '808758844', 'c': 798359434, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1432364045', 'c': 1578304330, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1663263763', 'c': 1378055717, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '830140435', 'c': 748137689, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '191368625', 'c': 1369606979, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '672827335', 'c': 981461372, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '412977198', 'c': 400220815, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1189047758', 'c': 1758820128, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '639796174', 'c': 1687103339, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1667682167', 'c': 1821576811, 'd': 'sometimes constant field']

sample.json
['a': 'constant_field', 'b': '857019877', 'c': 1923929452, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '2033103372', 'c': 728933312, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1735149371', 'c': 1118261193, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '671215230', 'c': 896311924, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '2034237595', 'c': 997536529, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1479944834', 'c': 1257271596, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '176175054', 'c': 269330267, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1274610059', 'c': 1168413542, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '1372225755', 'c': 459817170, 'd': 'sometimes constant field']['a': 'constant_field', 'b': '609547231', 'c': 845362304, 'd': 'sometimes constant field']

Dictionary looks like this:

['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a': '['a'...

dscheg on 21 Aug 2017

👍2

I'll look into it and see why exactly that is happening. It has all the "good" content at the end of the dictionary as it should, but it should stop adding segments to the dictionary.

However, the general rule of thumb is if you want to train a dictionary of size X, you want to have 100X bytes of data. Otherwise, there just isn't enough data to construct a good dictionary.

terrelln on 21 Aug 2017

👍1

terrelln on 21 Aug 2017

👍2

With pull #811 applied, I get:

./zstd -r --train samples
!  Warning : nb of samples too low for proper processing !
!  Please provide _one file per sample_.
!  Do not concatenate samples together into a single file,
!  as dictBuilder will be unable to find the beginning of each sample,
!  resulting in poor dictionary quality.
Trying 5 different sets of parameters
k=1024
d=8
steps=4
Save dictionary of size 303 into file dictionary

and the result is

./zstd -D dictionary sample.json -c > /dev/null
sample.json          : 17.01%   (   911 =>    155 bytes, /*stdout*\)

terrelln on 21 Aug 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Change default compression level

icebluey · 3Comments

Avoid generating empty frame?

animalize · 3Comments

ZSTD on iOS - alignment to 32 bits required?

vade · 3Comments

Better compression ratio if compression context periodically discarded

scherepanov · 3Comments

What's the Weissman score for this?

pjebs · 3Comments