After a match, you index "start+2" and "end-2" as far as I can tell. Even if the match length size is longer some 4 byte matches do come through. When the match length is 4, the same data is hashed twice.
In this code: https://github.com/facebook/zstd/blob/dev/lib/compress/zstd_double_fast.c#L253
You could change this
/* Fill Table */
hashLong[ZSTD_hashPtr(base+current+2, hBitsL, 8)] =
hashSmall[ZSTD_hashPtr(base+current+2, hBitsS, mls)] = current+2; /* here because current+2 could be > iend-8 */
hashLong[ZSTD_hashPtr(ip-2, hBitsL, 8)] =
hashSmall[ZSTD_hashPtr(ip-2, hBitsS, mls)] = (U32)(ip-2-base);
to something like (you probably want to separate out the variable):
/* Fill Table */
hashLong[ZSTD_hashPtr(base+current+2, hBitsL, 8)] =
hashSmall[ZSTD_hashPtr(base+current+2, hBitsS, mls)] = current+2; /* here because current+2 could be > iend-8 */
hashLong[ZSTD_hashPtr(ip-2+mLength==4, hBitsL, 8)] =
hashSmall[ZSTD_hashPtr(ip-2+mLength==4, hBitsS, mls)] = (U32)(ip-2+(mLength==4)-base);
This will index end-1 if mLength is 4.
On most inputs this gives an improvement with my similar implementation (with a fixed mls of 6):
silesia: 67674684 -> 67666622 = -8062 bytes
10gb.tar: 4638110010 -> 4637938132 = -171878 bytes
gob-stream: 208674003 -> 208669457 = -4546 bytes
But there is also a worse result:
enwik9 317822183-> 317838745 = +16562 bytes
I suspect it is more common to gain than lose compression, but you probably have a better test set than I do. I suspect smaller input sizes would gain even more, since they have lower match length sizes.
I can't measure any speed difference.
Thanks for the hint !
Indeed, the way tables are complemented is a pure heuristic, determined by tests, so there's a good chance it can be improved.
What you suggest makes sense, as it improves the filling rate, which matters more for small files than large ones (for large files, table will be saturated anyway).
I was slightly worried that such modification would affect speed, but apparently it does not, so it's definitely something worth trying out.
[it] would affect speed,
You probably have better tools to measure than I have. It is of course a dynamic branch with some variance, so it is not free. In my (Go) implementation it seems that it is resolved while the the first hash is being computed. Since Go forces a bounds check that latency will probably be different in C.
ip-2+(mLength==4) is something the compiler probably makes branch-less.
for example in some pseudo-code, where >>> is unsigned right shift and these are 32 bit ints:
tmp = mLength - 4 // zero iff mLength == 4
offset = ((tmp >>> 1) - tmp) >>> 31 // zero if mLength == 4, 1 otherwise
ip-(1+offset)
Actually it could be even simpler. Oh yeah and we want the reverse:
offset = (uint32(4-int32(mLength)) >> 31) ^ 1 // 1 if mLength <= 4, otherwise 0
(sorry for the Go-style cast - can't remember the C version)
Still not sure that will be faster than offset = mLength==4; 1 subtraction, 1 shift, 1 XOR (all constant, though)...
I've been testing this variation of complementary insertion for double_fast, with special case for mLength==4.
Good news is, it doesn't impact speed. It remains about the same. So we can concentrate purely on efficiency.
Bad news is, outcome is extremely small, barely measurable (beyond the 4th compression ratio digit), and as likely to be positive as negative.
I was hopeful that it would prove rather advantageous on small file, but it does not, at least not in a remarkable way.
Example :
| name | original size | (mLength==4) |
| --- | --- | --- |
| bib | 111261 | +15 |
| book1 | 768771 | -29 |
| book2 | 610856 | +35 |
| geo | 102400 | +0 |
| news | 377109 | +2 |
| obj1 | 21504 | +1 |
| obj2 | 246814 | -1 |
So, based on these results, the proposed change does not seem to produce a clear positive outcome.
However, this exercise gave me the opportunity to revisit this choice.
I don't remember how it was settled, it was likely copy/pasted from the fast strategy.
Anyway, after more testings, I believe that ip-1 is actually better than ip-2.
It does solve the mLength==4 issue mentioned, but more importantly, it is _generally_ better, and while still small, in a more measurable way (3rd compression ratio digit).
_edit_ : and one more tested change : I tried inserting both ip-2 and ip-1, and it does improve compression ratio, in a more systematic positive way. Speed is affected, though very lightly (~1%). This could become our new baseline for double_fast (level 3 and 4).
Great news! I am always amazed how changes to the hash table insertion affects compression in unpredictable ways. I have definitely moved away from "inserting more is always better" - especially on small hash tables. At least I have learned to always test my assumptions.
I will see if the Go implementation reacts in a similar way. Thanks for taking the time to look into this.
For your information, following this investigation, we just updated the double_fast heuristic in a subtle way, in https://github.com/facebook/zstd/pull/1681 .
As can be guessed, it also has a subtle impact, but at least this impact seems generally positive.
@Cyan4973 Thanks for the update!
I did my own tests, and got some size regressions, so inspired by your changes I ended up doing post-match indexing like this:
start+1 (long)start+2 (short)end-2 (long)end-1 (short)For my testset that resulted in:
| File | Througput | % Smaller (of compressed) |
|-------------------|-----------|-----------|
| silesia.tar | 1.02x | 0.18% |
| consensus.db.10gb | 1.03x | 0.01% |
| enwik9 | 1.01x | 0.04% |
| gob-stream | 1.01x | 0.24% |
| silesia.tar | 1.02x | 0.18% |
| 10gb.tar | 0.99x | -0.04% |
| adresser.json | 1.00x | 1.41% |
Thanks for feedback, @klauspost !
Interesting, I also tested this combination, but it felt no better than #1681.
Maybe it was a question of using a different corpus ...
It might be because I have slightly different settings. I don't have the same flexibility and must have fixed table sizes. So I have tried the different combinations to find the overall best. This could likely be why there are differences.
The short hash table is 15 bits, the long table 17 bits. The short hash is 5 bytes.
Most helpful comment
I've been testing this variation of complementary insertion for
double_fast, with special case formLength==4.Good news is, it doesn't impact speed. It remains about the same. So we can concentrate purely on efficiency.
Bad news is, outcome is extremely small, barely measurable (beyond the 4th compression ratio digit), and as likely to be positive as negative.
I was hopeful that it would prove rather advantageous on small file, but it does not, at least not in a remarkable way.
Example :
| name | original size |
(mLength==4)|| --- | --- | --- |
| bib | 111261 | +15 |
| book1 | 768771 | -29 |
| book2 | 610856 | +35 |
| geo | 102400 | +0 |
| news | 377109 | +2 |
| obj1 | 21504 | +1 |
| obj2 | 246814 | -1 |
So, based on these results, the proposed change does not seem to produce a clear positive outcome.
However, this exercise gave me the opportunity to revisit this choice.
I don't remember how it was settled, it was likely copy/pasted from the
faststrategy.Anyway, after more testings, I believe that
ip-1is actually better thanip-2.It does solve the
mLength==4issue mentioned, but more importantly, it is _generally_ better, and while still small, in a more measurable way (3rd compression ratio digit)._edit_ : and one more tested change : I tried inserting both
ip-2andip-1, and it does improve compression ratio, in a more systematic positive way. Speed is affected, though very lightly (~1%). This could become our new baseline fordouble_fast(level 3 and 4).