In RocksDB, we are experimenting with per-SST file dictionaries. A single DB instance can contain tens of thousands of SST files, in which case we'll have tens of thousands of dictionaries. I've been testing decompression CPU for this scenario and am seeing regression compared to not using dictionary.
Experiment setup: Two databases with ~30K files composed of 4KB data blocks. One DB's files were each compressed with a ZSTD-generated dictionary ("dict_db"), and the other one's files were compressed without dictionary ("no_dict_db"). I ran single-threaded random-read benchmark, so consecutive decompressions almost always use different dictionaries.
The function ZSTD_decompressSequences_bmi2 slowed down noticeably. Here are the top few functions according to perf -e cycles.
+ 19.59% db_bench db_bench [.] HUF_decompress4X1_usingDTable_internal_bmi2
+ 8.56% db_bench db_bench [.] ZSTD_decompressSequences_bmi2.isra.7
+ 8.20% db_bench db_bench [.] HUF_decompress4X_hufOnly_wksp_bmi2
+ 6.25% db_bench db_bench [.] rocksdb::IndexBlockIter::Seek
+ 16.87% db_bench db_bench [.] ZSTD_decompressSequences_bmi2.isra.7
+ 9.90% db_bench db_bench [.] HUF_decompress4X1_usingDTable_internal_bmi2
+ 8.21% db_bench db_bench [.] HUF_decompress4X2_usingDTable_internal_bmi2
+ 3.96% db_bench db_bench [.] HUF_decompress4X_hufOnly_wksp_bmi2
Looking at the assembly annotation, it is clear that much more time is spent reading the entropy tables in "dict_db"'s benchmark:
│ U32 const mlBits = seqState->stateML.table[seqState->stateML.state].nbAdditionalBits;
30.49 │ lea (%rax,%rdx,8),%rdi
│ U32 const ofBits = seqState->stateOffb.table[seqState->stateOffb.state].nbAdditionalBits;
0.08 │ mov -0x80(%rbp),%rdx
│ U32 const mlBits = seqState->stateML.table[seqState->stateML.state].nbAdditionalBits;
0.05 │ movzbl 0x2(%rdi),%eax
│ U32 const mlBase = seqState->stateML.table[seqState->stateML.state].baseValue;
18.97 │ mov 0x4(%rdi),%ebx
│ U32 const ofBits = seqState->stateOffb.table[seqState->stateOffb.state].nbAdditionalBits;
I guess it is due to high CPU cache misses. I tried __builtin_prefetch the whole ZSTD_DDict before using it. It did help a bit, but did not completely close the gap compared to no dictionary.
Do you mind taking a look to see if anything is fixable in the library, or if we should do something differently in our application (besides use less dictionaries :P)?
It's a hard topic, and your initial analysis looks spot on @ajkr .
The current design of dictionary decompression is more optimized for scenarios where a single dictionary is used to decode a large number of (small) objects.
Indeed, by switching dictionaries all the time, it makes the decoder use tables scattered at different memory adresses, dramatically increasing cache misses.
I also believe your suggested work around address the core issue.
Instead of discovering that requested table entry is not in the cache at the moment it's requested, thus paying a full memory access delay at (almost) each read operation, prefetching can make the CPU load the whole table into memory once, making subsequent read operations faster.
As an easy first step, we could start by looking at your prefetch optimization, just to make sure it is best optimized. I believe we could introduce a similar optimization directly within the library.
Another thing which is not yet clear to me is how much performance is saved in this way.
Going further will be more involved.
We will need to build a test tool which is able to mimic the behavior of RocksDB, scattering a lot of decompression tables into memory, in order to reproduce the issue, in a way which can be accurately measured. This will provide a benchmark to optimize against.
I don't foresee a ton of possibilities here.
prefetching, as you did, is likely our best bet.
Another more heavyweight approach is to locally copy tables into the same memory segment.
I expect it to perform similar or slower than prefetching, though it will require tests to be sure, we could be surprised.
We could also consider a special mode creating smaller dictionary tables, so that the amount to copy / prefetch is smaller, hence accelerating this initial stage. The downside being, it would be a special mode, requiring a handle to trigger it during dictionary creation. It would also generate a negative impact on decoding speed, so it's a trade off, more valuable when amount of data to decode is small.
Thanks Yann! Here is the code I'm using for prefetching: https://github.com/ajkr/rocksdb/commit/d090fbb5e5531b944b60d22124664f338f427e53
To show how much prefetching helps, below are the single-threaded random-read benchmark numbers I have from RocksDB:
It is pretty close to fixing the gap, although I still wonder if dictionary should be able to at least slightly outperform no-dictionary since the entropy tables are pre-processed? Maybe this requires the special small-table mode you mentioned? It'd be helpful for us, at least.
A test tool that mimics this behavior directly in ZSTD would be nice. I was also thinking about repro'ing it directly with ZSTD (i.e., outside of RocksDB) but didn't get around to it yet...
EDIT: Fixed the numbers and updated link to code.
There is a mistake in the above numbers. I'll rerun and then edit them.
Done. Unfortunately after fixing the mistake I couldn't find a way for this prefetching strategy to improve the benchmark result. Maybe too small a portion of the ZSTD_DDict is actually accessed during decompression.
That's interesting, and indeed, completely changes the outcome for prefetching strategy.
Thanks for the pointer and figures @ajkr !
I guess I have to build this test tool to observe the behavior in depth.
Maybe too small a portion of the ZSTD_DDict is actually accessed during decompression.
It may depend on how the ZSTD_DDict is created.
Do you use ZSTD_createDDict(), which will copy the content into the ZSTD_DDict, hence everything should be present into it ?
Or do you use ZSTD_createDDict_byReference(), which will only create the tables, but reference the dictionary content where it stands, hence somewhere else into memory ?
This would change the size of ZSTD_DDict by a potentially large amount.
It could be interesting to know what is the value of size_t ddict_len.
We were using ZSTD_createDDict(), and ddict_len was 32864.
I tried ZSTD_createDDict_byReference() as well. It changed ddict_len to 28768 and changed the benchmark throughput to 215.7 MB/s.
@ajkr , as I'm developing a test tool to imitate and benchmark conditions similar to RocksDB, with regards to large nb of dictionaries, I'm looking for additional information on :
Great. They are both 4KB in our use case.
8KB block is also being considered now.
Interestingly, I'm currently testing 4 KB blocks with 8 KB dictionaries ...
OK, I'm starting to get some results.
The current test program is still limited, but it's a base to observe the mentioned behavior.
Current limitation :
As a sample test, I've used silesia/dickens, a 9 MB file of english text.
The downside is that this file benefits only slightly from small dictionaries,
so maybe conclusion would be different with more "database-oriented" samples,
which might be more structured, hence benefit more from small dictionaries.
Anyway, here are some results on my laptop, using compression level 3 :
all File, no dict : R=2.759, dSpeed=460 MB/s /* <-- upper limit */
4K_blocks, no dict : R=2.004, dSpeed=300 MB/s /* <-- reference */
4K_blocks, 4Kdict : R=2.185, dSpeed=300 MB/s /* <-- single dictionary */
4K_blocks, 8Kdict : R=2.252, dSpeed=275 MB/s /* <-- single dictionary */
4K_blocks, 12Kdict : R=2.292, dSpeed=270 MB/s /* <-- single dictionary */
4K_blocks,110Kdict : R=2.574, dSpeed=320 MB/s /* <-- single dictionary */
These are reference results, using a single dictionary, hence remaining mostly within "hot" memory.
As one can see, decompression speed performance doesn't change much. This is partly because this file does benefits only little from dictionary, and because it contains a lot of short matches. It's possible that more "typical" DB data would behave differently here, containing less sequences, resulting in better speed.
Anyway, just for the exercise, I then tried the new test tool, which generates one separate dictionary _per block_, and randomizes ptr accesses. For this sample, it generates 2489 dictionaries. This maximizes probabilities of cache misses.
Here is the impact :
4K_blocks, 4Kdict : R=2.185, dSpeed=205 MB/s /* vs 300 : -31% Mem.usage : 78 MB */
4K_blocks, 8Kdict : R=2.252, dSpeed=175 MB/s /* vs 275 : -36% Mem.usage : 87 MB */
4K_blocks, 12Kdict : R=2.292, dSpeed=165 MB/s /* vs 270 : -38% Mem.usage : 97 MB */
4K_blocks,110Kdict : R=2.574, dSpeed=118 MB/s /* vs 320 : -63% Mem.usage :335 MB */
The tests show a serious impact to decompression speed performance, which increases with amount of memory usage.
I haven't tested yet on 8K blocks, but could modify the tool to do so.
Perhaps more importantly, I will probably need some representative samples in order to not mis-represent performance impact. This will be more important in next stage, when trying to design mitigation techniques.
Some more results,
using 8 KB blocks this time.
Source file is still silesia/dickens, cut into fixed size blocks
Here follow the results when using a _single dictionary_, hence preserving hot memory :
all File, no dict : R=2.759, dSpeed=460 MB/s /* <-- upper limit */
8K_blocks, no dict : R=2.114, dSpeed=320 MB/s /* vs 300 at 4KB */
8K_blocks, 4Kdict : R=2.229, dSpeed=275 MB/s /* vs 300 at 4KB */
8K_blocks, 8Kdict : R=2.290, dSpeed=270 MB/s /* vs 275 at 4KB */
8K_blocks, 12Kdict : R=2.321, dSpeed=265 MB/s /* vs 270 at 4KB */
8K_blocks,110Kdict : R=2.611, dSpeed=330 MB/s /* vs 320 at 4KB */
Compression ratio is improved a little compared to 4K blocks, with no_dictionary benefiting the most.
Decompression speed is roughly the same as 4K, with a small improvement for no_dictionary.
Now let's see the impact when using multiple dictionaries.
As blocks get larger, the nb of blocks is reduced, to __1245__ .
The nb of dictionaries generated is therefore reduced by the same amount :
8K_blocks, 4Kdict : R=2.229, dSpeed=250 MB/s /* -21% vs nodict; +22% vs 4K; Mem.usage : 39 MB */
8K_blocks, 8Kdict : R=2.290, dSpeed=235 MB/s /* -26% vs nodict; +34% vs 4K; Mem.usage : 43 MB */
8K_blocks, 12Kdict : R=2.321, dSpeed=220 MB/s /* -31% vs nodict; +33% vs 4K; Mem.usage : 48 MB */
8K_blocks,110Kdict : R=2.611, dSpeed=155 MB/s /* -51% vs nodict; +31% vs 4K; Mem.usage :167 MB */
As one can see, the decompression speed performance is much better than for the 4K_blocks version.
But let's not jump to conclusion too fast : it remains to be seen if the speed improvement is related to reduced dictionary memory budget, due to reduced nb of dictionaries.
I will have to change the test program to control more directly this value.
I modified the test program so that block size, compression level and nb of dictionary can be controlled directly on command line.
I then restarted the test with 4K blocks, and made it generate the same nb of dictionaries as the 8K blocks one.
It did not changed the result. Speed remained exactly the same as previous test, using twice more dictionaries.
Therefore, it seems that 8K blocks + lots of dictionaries
decompresses 20-30% faster than 4K blocks with same scenario.
_Note_ : this experiment does not prefetch dictionary content, only the entropy tables.
In this test and all following ones, compression level used is default __level 3__.
4K blocks, 2489 dictionaries. Speed is compared vs no_prefetch xp :
4K_blocks, 4Kdict : R=2.185, dSpeed=235 MB/s /* vs 205 : +14% */
4K_blocks, 8Kdict : R=2.252, dSpeed=205 MB/s /* vs 175 : +17% */
4K_blocks, 12Kdict : R=2.292, dSpeed=195 MB/s /* vs 165 : +18% */
4K_blocks,110Kdict : R=2.574, dSpeed=128 MB/s /* vs 118 : + 8% */
analysis : the outcome is favorable, by a clear margin.
This seems different from previous result published by @ajkr.
It could be due to 2 effects :
DDict, instead of externally, the operation can be made more accurate, targeting precisely the entropy tables only.Also, note that the "large dictionary" benefits a bit less. This is probably because in this case, as the content is larger, the amount of matches that copy data from dictionary content is larger. But dictionary content has not been prefetched, resulting in additional cache misses delays.
Prefetching the dictionary content on top of the entropy tables is an interesting follow up investigation.
Second experiment : 8 KB blocks, 1245 dictionaries. Speed is compared vs no_prefetch xp :
8K_blocks, 4Kdict : R=2.229, dSpeed=238 MB/s /* vs 250 : - 4% */
8K_blocks, 8Kdict : R=2.290, dSpeed=228 MB/s /* vs 235 : - 3% */
8K_blocks, 12Kdict : R=2.321, dSpeed=216 MB/s /* vs 200 : - 2% */
8K_blocks,110Kdict : R=2.611, dSpeed=162 MB/s /* vs 155 : + 4% */
The results are more mitigated in this case, being generally _negative_.
It shows that prefetching is not a magic win.
I suspect the difference is that, for 8 KB blocks, entropy tables are more often recalculated for the block. As a consequence, the ones in the dictionary are less used. Therefore, prefetching them is just an additional load, with no benefit.
This suggests that it would be better to prefetch tables only if we know that they are actually going to be used.
This is definitely something I will experiment with. It's a bit more complex to setup, but remains easily accessible for an experiment (the generalization is a bit more tricky, but that's for another day).
4K blocks, 2489 dictionaries. Speed is compared vs prefetch_entropy_only :
4K_blocks, 4Kdict : R=2.185, dSpeed=242 MB/s /* vs 235 : + 3% */
4K_blocks, 8Kdict : R=2.252, dSpeed=218 MB/s /* vs 205 : + 6% */
4K_blocks, 12Kdict : R=2.292, dSpeed=212 MB/s /* vs 195 : + 8% */
4K_blocks,110Kdict : R=2.574, dSpeed=167 MB/s /* vs 128 : +30% */
This strategy is a win, always, increasing with content size.
It even becomes pretty significant for large content size, a result I've been quite surprised to measure, since it should also presumably increase prefetching load.
There must be some limit somewhere : I can't imagine a 1 MB dictionary prefetch to be useful.
I should also consider the impact of prefetch load with regards to small input data (like 1 KB).
This is of no concern for RocksDB, but could impact generalization formula.
Same test for 8 KB blocks, 1245 dictionaries. Speed is compared vs prefetch_entropy_only :
8K_blocks, 4Kdict : R=2.229, dSpeed=240 MB/s /* vs 238 : + 1% */
8K_blocks, 8Kdict : R=2.290, dSpeed=235 MB/s /* vs 228 : + 3% */
8K_blocks, 12Kdict : R=2.321, dSpeed=232 MB/s /* vs 216 : + 7% */
8K_blocks,110Kdict : R=2.611, dSpeed=218 MB/s /* vs 162 : +34% */
The conclusion is the same : prefetching dictionary content is good for performance, if only by a little. Gains increase with content size.
Note : this test wasn't available to @ajkr, because the dictionary content is located in another memory region. This information is opaque outside of DDict.
Note : this use case it not relevant for RocksDB. The purpose is to investigate a generalized prefetch strategy.
First, for reference, results using a single "hot" dictionary.
This represents the speed upper limit.
1K_blocks, no_dict : R=1.713, dSpeed=202 MB/s /* <-- speed reference without dictionary */
1K_blocks, 4Kdict : R=2.035, dSpeed=258 MB/s /* <-- single dictionary */
1K_blocks, 8Kdict : R=2.105, dSpeed=250 MB/s /* <-- single dictionary */
1K_blocks, 12Kdict : R=2.140, dSpeed=243 MB/s /* <-- single dictionary */
1K_blocks,110Kdict : R=2.432, dSpeed=220 MB/s /* <-- single dictionary */
A first comment : note that dictionary is a lot more helpful here. Not only does compression ratio increases substantially, decompression speed is improved too. As the dictionary content size increases, compression ratio does up substantially, and decompression speed slows down a bit, presumably because more and more content is fetched from the dictionary content, messing up a few branch predictions.
Now, let's analyze what happens with "cold" dictionaries :
1K blocks, 9954 dictionaries. Speed is compared vs single "hot" dictionary :
1K_blocks, 4Kdict : R=2.035, dSpeed= 92 MB/s /* vs 258 : -64% . Mem.usage : 312 MB */
1K_blocks, 8Kdict : R=2.105, dSpeed= 87 MB/s /* vs 250 : -65% . Mem.usage : 350 MB */
1K_blocks, 12Kdict : R=2.140, dSpeed= 86 MB/s /* vs 243 : -65% . Mem.usage : 390 MB */
1K_blocks,110Kdict : R=2.432, dSpeed= 73 MB/s /* vs 220 : -66% . Mem.usage :1342 MB */
The cost of always accessing "cold" data is pretty huge, as could be guessed.
Let's now try to add prefetching for entropy tables :
1K blocks, 9954 dictionaries. Speed is compared vs no_prefetch xp :
1K_blocks, 4Kdict : R=2.035, dSpeed=139 MB/s /* vs 92 : +51% . Mem.usage : 312 MB */
1K_blocks, 8Kdict : R=2.105, dSpeed=124 MB/s /* vs 87 : +42% . Mem.usage : 350 MB */
1K_blocks, 12Kdict : R=2.140, dSpeed=114 MB/s /* vs 86 : +32% . Mem.usage : 390 MB */
1K_blocks,110Kdict : R=2.432, dSpeed= 82 MB/s /* vs 73 : +12% . Mem.usage :1342 MB */
As expected, prefetching entropy table helps a lot. Dictionary tables are used really often when the size of block is small, making a large difference.
Gains are reduced with dictionary content size, presumably because this content is not cached, hence miss rate increases.
Let's prefetch dictionary content then :
1K blocks, 9954 dictionaries. Speed is compared vs prefetch_entropy_only :
1K_blocks, 4Kdict : R=2.035, dSpeed=150 MB/s /* vs 139 : + 7% . Mem.usage : 312 MB */
1K_blocks, 8Kdict : R=2.105, dSpeed=145 MB/s /* vs 124 : +17% . Mem.usage : 350 MB */
1K_blocks, 12Kdict : R=2.140, dSpeed=137 MB/s /* vs 114 : +20% . Mem.usage : 390 MB */
1K_blocks,110Kdict : R=2.432, dSpeed= 72 MB/s /* vs 82 : -12% . Mem.usage :1342 MB */
Here, we find our limit.
The impact of prefetching dictionary content starts by being beneficial.
But at some point, the amount to prefetch becomes too large compared to the amount of data to decompress, and its cost outweigh its benefit.
As a consequence, it's not a trivial "just prefetch everything" decision.
The decision should factor in the amount of data to decompress.
Unfortunately, this amount is not available at the place where I currently do the prefetching, aka, at the beginning of a new decompression session. It's not yet known there.
To know the size to decompress, one must first decode the frame header. Note that, even in this case, this information is not guaranteed to be available. But we could decide to implement some kind of conservative default.
The main issue is that this construction requires much larger code changes.
_Follow up :_
I went testing different prefetching sizes for dictionary content, with the hindsight that the last part of the dictionary is the most important one.
With 1 KB blocks, I got best result with prefetching only the last 32 KB of dictionary content.
1K_blocks,110Kdict : R=2.432, dSpeed= 87 MB/s /* vs 72 : +20% vs prefetch all */
Unfortunately, the same limit doesn't work well with larger block sizes :
4K_blocks,110Kdict : R=2.574, dSpeed=145 MB/s /* vs 167 : -13% vs prefetch all */
8K_blocks,110Kdict : R=2.611, dSpeed=180 MB/s /* vs 218 : -17% vs prefetch all */
So 32 KB is not a "universal" amount. The problem is, matches are still spread out over the full dictionary content length, even if the end part is "more probable".
Which means, as suspected, a "good amount" of dictionary content to prefetch depends on the amount of data to decompress.
A more complete solution would be to pre-analyze the full decoding operation, and prefetch everything that's requested. This is a larger change to the decoding pipeline. We'll go there eventually, but on short term, we need a simpler heuristic that can be shipped in next release.
_Complementary_ :
For generality, I've been testing the "small huffman table" proposal below with 1KB blocks.
And while the effect is not dramatic, it's nonetheless consistently within positive territory.
So it's a viable strategy for small (1 KB) blocks.
Building up on conclusions from experiment 1, the idea is to only prefetch a table from dictionary if we know it's actually going to be used.
This strategy pushes prefetch decision much later in the pipeline, after we get to decode literals and sequences headers inside the block.
We expect this strategy to be more favorable for 8K blocks, as prefetching unconditionally entropy tables resulted in a small but measurable decompression speed loss.
4K blocks, 2489 dictionaries. Speed is compared vs prefetch_unconditional :
4K_blocks, 4Kdict : R=2.185, dSpeed=243 MB/s /* vs 242 : + 0% */
4K_blocks, 8Kdict : R=2.252, dSpeed=223 MB/s /* vs 218 : + 2% */
4K_blocks, 12Kdict : R=2.292, dSpeed=217 MB/s /* vs 212 : + 2% */
4K_blocks,110Kdict : R=2.574, dSpeed=171 MB/s /* vs 167 : + 2% */
Performance is pretty stable. At least, it's not detrimental.
Same test for 8 KB blocks, 1245 dictionaries. Speed is compared vs prefetch_unconditional :
8K_blocks, 4Kdict : R=2.229, dSpeed=255 MB/s /* vs 240 : + 6% */
8K_blocks, 8Kdict : R=2.290, dSpeed=245 MB/s /* vs 235 : + 4% */
8K_blocks, 12Kdict : R=2.321, dSpeed=238 MB/s /* vs 232 : + 2% */
8K_blocks,110Kdict : R=2.611, dSpeed=225 MB/s /* vs 218 : + 3% */
It's indeed a bit better. Nothing earth shattering, just a little speed boost.
More importantly, compared to _not prefetching entropy tables_, this strategy is now a net gain, so we can enable it without second thoughts.
For illustration, here is same result again, but compared with _entropy tables not prefetched_ (note: dictionary content is prefetched in both cases) :
8K_blocks, 4Kdict : R=2.229, dSpeed=255 MB/s /* vs 251 : + 2% */
8K_blocks, 8Kdict : R=2.290, dSpeed=245 MB/s /* vs 240 : + 2% */
8K_blocks, 12Kdict : R=2.321, dSpeed=238 MB/s /* vs 231 : + 3% */
8K_blocks,110Kdict : R=2.611, dSpeed=225 MB/s /* vs 198 : +13% */
Indeed, conditional prefetching of entropy tables is always a win.
Zstandard features an ability to encode literals using Huffman prefixes.
On the decoding side, Zstandard can select between a regular decoder, which outputs 1 symbol per table lookup, and an "accelerated" version, able to decode up to 2 symbols per lookup.
Speedwise, the 2-symbol variant is an important speed boost.
However, it comes at a cost : tables are larger (16 KB), and need more time to build.
Tests showed that using the 2-symbol variant is the better variant in most circumstances.
Since it needs to offset its longer build time, the outcome primarily depends on the nb of symbols to decode (and their compressibility). But generally, a few KB is enough.
Well, a few KB, that's a lot for blocks of 4 KB.
For dictionaries though, the 2-symbol variant is always selected, because the only reason to select the 1-symbol variant is its shorter build time. And with dictionaries, tables are already built, so they do not cost any build time.
However, in the case of "cold" dictionaries, it could be preferable to prefetch 4 KB (1-symbol variant) rather than 16 KB (2-symbol variant). Less prefetching to do, but slower literal decoding speed. There is likely a point of trade off somewhere.
Logically, we expect this strategy to be more beneficial to 4 KB blocks compared to 8 KB ones. So let's see.
4K blocks, 2489 dictionaries, "small" X1 huffman table. Speed is compared vs xp4 (prefetch_conditional) :
4K_blocks, 4Kdict : R=2.185, dSpeed=246 MB/s /* vs 243 : + 1% */
4K_blocks, 8Kdict : R=2.252, dSpeed=207 MB/s /* vs 223 : - 7% */
4K_blocks, 12Kdict : R=2.292, dSpeed=204 MB/s /* vs 217 : - 6% */
4K_blocks,110Kdict : R=2.574, dSpeed=170 MB/s /* vs 171 : - 1% */
Okay this is disappointing.
Difference are small, and we never expected gains to be large from reducing prefetching size by a mere 12 KB, but still, this stands within negative territory.
The 8K blocks situation shouldn't be any better, since it should contain more literals to decode :
8K_blocks, 4Kdict : R=2.229, dSpeed=257 MB/s /* vs 255 : + 1% */
8K_blocks, 8Kdict : R=2.290, dSpeed=245 MB/s /* vs 245 : + 0% */
8K_blocks, 12Kdict : R=2.321, dSpeed=241 MB/s /* vs 238 : + 1% */
8K_blocks,110Kdict : R=2.611, dSpeed=219 MB/s /* vs 225 : - 3% */
Well, this is not so bad. I'm actually quite surprised.
There is side effect which could explain this outcome :
whenever the nb of literals to compress becomes "large", the encoder will create dedicated huffman statistics, in order to compress better. In which case, a new huffman table will be built, and nothing will be needed to be prefetched from dictionary (since we use "conditional prefetching" in this experiment).
Therefore, the strategy to prefetch less data is triggered less often, hence is less impactful.
As a conclusion, it seems that, from a decoding speed perspective, this strategy does not pay well enough for its complexity. It's not worth implementing it.
On the other hand, one of its side effects could become interesting.
With smaller huffman tables, it would be possible to reduce DDict objects.
This is not possible right now, because DDict size is defined statically, through a structure.
But it could be implemented differently. In which case, a DDict could become 12 KB smaller.
12 KB is not large by itself, but when considering thousands of dictionaries remaining in memory, it can make a sizable difference.
Mitigation patch merged into dev