zstd_btultra corner case

Created on 7 Jan 2021 · 7Comments · Source: facebook/zstd

zstd_btultra performs nearly 2x worse than zstd_opt on a edge case file.

> seq 100000 999999 > seqs
> zstd seqs -b16e19i0            
16#seqs              :   6300000 ->    321618 (19.59),  15.0 MB/s , 817.0 MB/s 
17#seqs              :   6300000 ->    323943 (19.45),  10.6 MB/s , 831.3 MB/s 
18#seqs              :   6300000 ->    618382 (10.19),  8.60 MB/s , 778.1 MB/s 
19#seqs              :   6300000 ->    618096 (10.19),  7.31 MB/s , 779.3 MB/s

Source

terrelln

👍1

Most helpful comment

Running my multi-arrivial branch which stores the 8 best arrivals for each position, and checks non-repcodes for the best arrival, and repcodes for every arrival. It is very slow, but does much better in this corner case.

> ./zstd -b16e22 ~/datasets/seqs
16#seqs              :   6300000 ->    227770 (27.66),  3.25 MB/s , 781.7 MB/s 
17#seqs              :   6300000 ->    233354 (27.00),  3.03 MB/s , 781.2 MB/s 
18#seqs              :   6300000 ->    149662 (42.09),  1.86 MB/s , 789.2 MB/s 
19#seqs              :   6300000 ->    149389 (42.17),  1.85 MB/s , 788.8 MB/s

terrelln on 7 Jan 2021

🚀1 👍1

All 7 comments

One difference I can think of is using fractional bit evaluation (btultra) instead of full bit approximation (btopt).
It could end up triggering a different optimization gradient, which ends up being incorrect (?)

Cyan4973 on 7 Jan 2021

One difference I can think of is using fractional bit evaluation

That was my thought, since the regression is caused by btultra having 2x more literals. But I disabled fractional bits and didn't see a difference.

I'm hunting down every difference between btopt and btultra to find out what is causing it.

terrelln on 7 Jan 2021

👍1

This is the minimal changes that fix the issue. Making one change or the other doesn't do anything, you need both.

diff --git a/lib/compress/zstd_opt.c b/lib/compress/zstd_opt.c
index e55c459d..995a8251 100644
--- a/lib/compress/zstd_opt.c
+++ b/lib/compress/zstd_opt.c
@@ -24,10 +24,10 @@
 *  Price functions for optimal parser
 ***************************************/

-#if 0    /* approximation at bit level */
+#if 1    /* approximation at bit level */
 #  define BITCOST_ACCURACY 0
 #  define BITCOST_MULTIPLIER (1 << BITCOST_ACCURACY)
-#  define WEIGHT(stat)  ((void)opt, ZSTD_bitWeight(stat))
+#  define WEIGHT(stat,opt)  ((void)opt, ZSTD_bitWeight(stat))
 #elif 0  /* fractional bit accuracy */
 #  define BITCOST_ACCURACY 8
 #  define BITCOST_MULTIPLIER (1 << BITCOST_ACCURACY)
@@ -1076,7 +1076,7 @@ ZSTD_compressBlock_opt_generic(ZSTD_matchState_t* ms,

             if (cur == last_pos) break;

-            if ( (optLevel==0) /*static_test*/
+            if ( (1||optLevel==0) /*static_test*/
               && (opt[cur+1].price <= opt[cur].price + (BITCOST_MULTIPLIER/2)) ) {
                 DEBUGLOG(7, "move to next rPos:%u : price is <=", cur+1);
                 continue;  /* skip unpromising positions; about ~+6% speed, -0.01 ratio */

With this diff it is:

16#seqs              :   6300000 ->    321644 (19.59),  15.6 MB/s , 865.0 MB/s 
17#seqs              :   6300000 ->    323967 (19.45),  11.6 MB/s , 885.4 MB/s 
18#seqs              :   6300000 ->    183556 (34.32), 10.33 MB/s , 889.6 MB/s 
19#seqs              :   6300000 ->    182603 (34.50),  8.37 MB/s , 891.8 MB/s

So there seems to be some sort of pricing issue. My guess is literal pricing, since the alphabet size is 10.

terrelln on 7 Jan 2021

I think the root cause is actually a case of not being to evaluate repcode choice.

At some point, zstd_btultra decides to take an offset instead of adding 2 literals + a repcode. I have to figure out why it made that choice.

That offset messes up the recode history (rep0 = 2 lits + ml 5, rep1 = 1 lit + ml 6). But, from that point on "repcode 0" is so heavily favored that repcode0 + 2 lits is favored over repcode1 + 1 lit.

The optimal parser doesn't know that using repcode1 puts it back into repcode0.

terrelln on 7 Jan 2021

👀1 👍1

> ./zstd -b16e22 ~/datasets/seqs
16#seqs              :   6300000 ->    227770 (27.66),  3.25 MB/s , 781.7 MB/s 
17#seqs              :   6300000 ->    233354 (27.00),  3.03 MB/s , 781.2 MB/s 
18#seqs              :   6300000 ->    149662 (42.09),  1.86 MB/s , 789.2 MB/s 
19#seqs              :   6300000 ->    149389 (42.17),  1.85 MB/s , 788.8 MB/s

terrelln on 7 Jan 2021

🚀1 👍1

I was testing this use case with a smaller sequence set, from 10000 to 99999,
and was expecting to witness roughly the same outcome, with just a size difference.

It doesn't : ranking of btopt and btultra is reversed (compared to sequences 100000 to 999999)

| levels | Ratio | Note |
| --- | --- | --- |
| 1-2 | x2.3 | does not find match
| 3 - 7 | x22 | good ! greedy and lazy
| 8 - 15 | x10 | worse :( lazy2 and btlazy2
| 16 - 17 | x7 | even worse... btopt
| 18 - 22 | x19 | better, but not "best" btultra

It shows that, depending on gradiant decent, itself dependent on initialization and, well, almost random luck, then one or the other get stuck into a "good" or "bad" state with regards to repcode selection, and then just stays there.

None of them is excellent though, and your multi-arrivial experiment show that it could be much better.

Cyan4973 on 8 Jan 2021

👍1

It shows that, depending on gradiant decent, itself dependent on initialization and, well, almost random luck, then one or the other get stuck into a "good" or "bad" state with regards to repcode selection, and then just stays there.

Yeah, it seems to be just getting lucky or not lucky.

I don't remember why I stopped working on the multi-arrivial. Probably just got busy & it was still very slow. I'll try to pick it up at some point, and working on speeding it up (or reserving it for its own strategy above zstd_btultra & using it at level 19+).

terrelln on 8 Jan 2021

Was this page helpful?

0 / 5 - 0 ratings