Zstd: significantly worse compression at mid-levels

Created on 5 Mar 2019  路  8Comments  路  Source: facebook/zstd

I'm using stable zstd from Python via ctypes as below.

cctx = zstd.ZSTD_createCCtx()
zstd.ZSTD_CCtx_setParameter(cctx, 100, level) # ZSTD_c_compressionLevel
zstd.ZSTD_CCtx_setParameter(cctx, 160, 1) # ZSTD_c_enableLongDistanceMatching
zstd.ZSTD_CCtx_setParameter(cctx, 101, 30) # ZSTD_c_windowLog
zstd.ZSTD_compress2(cctx, ...)

With some data, I noticed worse compression at mid-levels.

>>> import hashlib
>>> import zlib
>>> import zstd
>>> 
>>> data = '\n'.join('%01000i-%s' % (x, hashlib.sha1(str(x)).hexdigest()) for x in xrange(100000))

At the above context parameters.

>>> for level in range(0, 20):
...   print '%2i' % level, len(zstd.compress(data, level))
... 
 0 2812376
 1 2515911
 2 2604028
 3 2812376
 4 3092223
 5 3169946
 6 3171094
 7 3122268
 8 3103687
 9 3143055
10 3158913
11 3164377
12 3164371
13 3163761
14 3165556
15 3164949
16 2614982
17 2621542
18 2600424
19 2639589

Same trend but better values without custom parameters 160 and 101.

 0 2737876
 1 2499981
 2 2568151
 3 2737876
 4 2772291
 5 2780765
 6 2782399
 7 2737815
 8 2721064
 9 2723085
10 2830334
11 2832962
12 2832962
13 2831237
14 2832843
15 2832835
16 2589461
17 2598022
18 2548138
19 2548727

And zlib for comparison.

>>> for level in range(1, 10):
...   print '%2i' % level, len(zlib.compress(data, level))
... 
 1 3878738
 2 3878788
 3 3878723
 4 3076001
 5 3076028
 6 3076034
 7 3076034
 8 2997640
 9 2997640

Most helpful comment

Thanks for the detailed explanation.

I shall note though that in the actual data I use %010i rather than %01000i which shows a different oddity at levels 18-19. I used 1000 to better demonstrate the effect at mid-levels.

>>> for level in range(0, 20):
...     print '%2i' % level, len(zstd.compress(data, level))
... 
 0 2536317
 1 2366708
 2 2423876
 3 2536317
 4 2665321
 5 2652715
 6 2641414
 7 2610699
 8 2603081
 9 2609117
10 2606528
11 2606050
12 2606033
13 2606257
14 2605663
15 2605767
16 2196969
17 2196968
18 2418041
19 2418447

All 8 comments

This data is an edge case, and not indicative of performance on real data. Nonetheless, I have about why performance degrades. It is just a guess, I haven't tested that it is the case. I will investigate why the optimal parser used at level 19 isn't able to beat level 1, since it should be able to.

This is an edge case where zstd's "repcodes" help immensely. Repcodes are 2 offset codes for repeat offsets. If you find a match at the same offset as one of the last 2 matches you emitted, zstd can
encode that incredibly cheaply.

Levels 1 and 2 takes advantage of the first repcode (rep0). It doesn't know to look for the second repcode (rep1), but since it isn't great at finding small matches in the hex/non-zero number data it can use rep0 heavily.

Levels 3-15 are better at finding small matches, but don't know how to look for the second repcode (rep1), so they often can't use rep0. This is a bad bet on this data, and compression ratio suffers.

Levels 16-19 are the "optimal parser". Since the matches are 1000 bytes, we're actually hitting an edge case of the optimal parser where it just emits the first match it finds. If we set ZSTD_c_targetLength = 2000, we'll see how the optimal parser actually performs:

> zstd testdata -19 -o /dev/null
testdata             :  2.45%   (104199999 => 2557397 bytes, /dev/null)
> zstd testdata -19 -o /dev/null --zstd=tlen=2000
testdata             :  2.24%   (104199999 => 2338067 bytes, /dev/null)

After we increase the target length, I see that the optimal parser is using rep0 and rep1 heavily. Before setting the target length, the optimal parser isn't making great use of the repcodes. I think there is an edge case we can improve by preferring repcodes when we hit the target length.

The reason the optimal parser chooses an offset rather than a repcode when target length < 1000 is because it finds the first long match starting with one of the last bytes of the hex digest. This doesn't have a regular offset, so the repcodes can't be used. Target length is an escape hatch to improve parsing speed, and the analysis when target length is > 2000 looks right, so there isn't anything actionable to improve.

Thanks for the detailed explanation.

I shall note though that in the actual data I use %010i rather than %01000i which shows a different oddity at levels 18-19. I used 1000 to better demonstrate the effect at mid-levels.

>>> for level in range(0, 20):
...     print '%2i' % level, len(zstd.compress(data, level))
... 
 0 2536317
 1 2366708
 2 2423876
 3 2536317
 4 2665321
 5 2652715
 6 2641414
 7 2610699
 8 2603081
 9 2609117
10 2606528
11 2606050
12 2606033
13 2606257
14 2605663
15 2605767
16 2196969
17 2196968
18 2418041
19 2418447

That is interesting, it is a regression from v1.3.4, which did fine on level 19.

The ZSTD_btUltra strategy is what is causing the regression. If I force this shortcut to be enabled for ZSTD_btUltra, the compressed size is back to 2196418.

I'm seeing a similar, though less severe, inversion with zstd 1.4.4 on some TSV files.

Level-7 compressed version of a sample file has been posted to https://www.dropbox.com/s/esvncl6retbiqh2/plink2.afreq.zst?dl=1 . The first seven levels resulted in the following sizes:
1: 651815819
2: 698805640
3: 717580985
4: 710956424
5: 696648695
6: 673417560
7: 640748434

Is this expected behavior? This is a very typical output file from my program, so does it make sense for me to change my default compression level from 3 to 1 when generating similar TSVs, or should I expect a future zstd version to not have this quirk?

This is not typical, but this can happen unfortunately.

Each compression level bets on different patterns, and the bet may be right or wrong.
This is especially the case with the first 5 levels. After which point, later levels tend to be just "stronger" versions of the same bet, so they _should_ compress better (and slower) in most cases. Another big difference happens at level 15-16.

Level 1 bets on 7-bytes matches, while level 2 bets on 6-bytes matches, so it may settle early on a 6-bytes match, which are supposed more frequent, missing a potentially better 7-bytes one which would happen just after.

Level 3-4 are different, they bet on 2 lengths, 5 and 8.

After that, level 5+ are a bit different, they bet on 5-bytes matches, but they receive multiple attempts, and as the level increases, they get more and more attempts.

In most cases, a file will contain a variety of matches of many lengths, featuring typically a descending curve, with shorter matches being more frequent, but also less valuable.

However, some file may feature strange patterns, which do not respect the "usual" distribution of lengths. As a potential scenario, maybe the tested file happens to feature a lot of 7-bytes matches. This would be ideal for level 1, but would be less adapted for the bets of some higher levels.

Bottom line : no bet is "universal". Something has to give. We test our parameters frequently, using generic samples presumed "representative" of some "general" case, but there are always file which are _not_ "general", and for them, these bets may end up being less adapted.

Ok, thanks for the explanation.

This came up when I was confirming that level-1 compression was a sensible default for large temporary TSV files; I was caught by surprise when it achieved a better compression ratio than level 3, and that upon further investigation this happened across a wide variety of other TSV files I generated. I don't think there is any sort of local maximum re: 7-byte matches that holds across all of these files; but they certainly have a more regular structure than ordinary text.

Was this page helpful?
0 / 5 - 0 ratings