I wrote a pyzstd module for Python, it has a "rich memory mode" that the size of output buffer is provided by ZSTD_compressBound() function.
The document said it should faster, but I observed that when the code is compiled with MSVC2019, "rich memory mode" mode is slower.
This is a test code, it test two cases:
ZSTD_compressBound(srcSize) - 1ZSTD_compressBound(srcSize)On Windows 10 64-bit, MSVC2019:
read 545863680 bytes.
ZSTD_compressBound() value: 547995960
ZSTD_compressBound() - 1
output size: 286842128, time: 9270 ms
ZSTD_compressBound()
output size: 286842126, time: 14669 ms
ZSTD_compressBound() output buffer size is slower a lot.
FYI, compress the same file using pyzstd module (compiled with MSVC2019):
ZSTD_compressBound()-1 consumes 4.9 secondsZSTD_compressBound() consumes 6.9 secondsWhile compiled with GCC 9.3.0, ZSTD_compressBound() output buffer size is faster as expected:
On Windows 10 64-bit, Cygwin64 (gcc 9.3.0):
ZSTD_compressBound() - 1
output size: 286842128, time: 4530 ms
ZSTD_compressBound()
output size: 286842126, time: 4376 ms
On Windows 10 WSL2, Ubuntu 9.3.0-17ubuntu1~20.04:
ZSTD_compressBound() - 1
output size: 286842128, time: 4578125 ms (seems microsecond)
ZSTD_compressBound()
output size: 286842126, time: 4390625 ms (seems microsecond)
I #define DEBUGLEVEL 5, then compare the output of two builds (MSVC2019 / Cygwin64), they are exactly same except for memory addresses.
Maybe it's not a problem of zstd code.
compress\zstd_compress.c: ZSTD_compressStream2, endOp=2
compress\zstd_compress.c: ZSTD_getCParams_internal (cLevel=3)
compress\zstd_compress.c: ZSTD_compressStream2 : transparent init stage
compress\zstd_compress.c: ZSTD_getCParams_internal (cLevel=3)
compress\zstd_compress.c: ZSTD_resetCStream_internal
compress\zstd_compress.c: ZSTD_getCParams_internal (cLevel=3)
compress\zstd_compress.c: ZSTD_compressBegin_internal: wlog=21
compress\zstd_compress.c: ZSTD_resetCCtx_internal: pledgedSrcSize=545863680, wlog=21
compress\zstd_compress.c: chainSize: 65536 - hSize: 131072 - h3Size: 0
compress\zstd_compress.c: Need 3567KB workspace, including 768KB for match state, and 2304KB for buffers
compress\zstd_compress.c: windowSize: 2097152 - blockSize: 131072
compress\zstd_compress.c: Resize workspaceSize from 0KB to 3567KB
...
I #define DEBUGLEVEL 5, and add microsecond timestamp to DEBUGLOG.
It seems the time between these two has significant difference:
compress/zstd_double_fast.c : ZSTD_compressBlock_doubleFast_genericcompress/zstd_compress.c : ZSTD_compressSequences_internal (nbSeq=dddd) Some examples:
msvc gcc
1679 909 (microseconds)
1358 800
1290 791
1490 826
1560 880
1461 847
...
If #define DEBUGLEVEL 6, there are a lot of Cpos line between the above two:
1 flib/compress/zstd_double_fast.c : ZSTD_compressBlock_doubleFast_generic
4 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048576 : 2 literals, match 6 bytes at offCode 71959
4 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048584 : 2 literals, match 5 bytes at offCode 88375
3 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048591 : 0 literals, match 7 bytes at offCode 223824
9 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048598 : 24 literals, match 14 bytes at offCode 89233
2 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048636 : 0 literals, match 8 bytes at offCode 112
...(Cpos has more than ten thousand lines)
3 flib/compress/zstd_compress.c : ZSTD_compressSequences_internal (nbSeq=11973)
Sorry, I made a mistake, not solved yet.
It seems the time between these two has significant difference:
If you are testing compression speed using default compression level,
this is expected :
ZSTD_compressBlock_doubleFast_generic is where the match finding happens
ZSTD_compressSequences_internal is where the encoding happens
In term of relative importance, ZSTD_compressBlock_doubleFast_generic is expected to use significantly more time than ZSTD_compressSequences_internal (likely ~3x).
In analyzing the performance differences between gcc and msvc, these are indeed prime targets: they likely concentrate > 90% of cpu time.
If #define DEBUGLEVEL 6, there are a lot of Cpos line
These are the matches found by ZSTD_compressBlock_doubleFast_generic.
I would expect this list to be identical between gcc and msvc,
otherwise, it would lead to different compressed results.
I #define DEBUGLEVEL 5, then compare the output of two builds (MSVC2019 / Cygwin64), they are exactly same except for memory addresses.
This is expected.
Differences in memory addresses shouldn't lead to any performance difference.
So far, the guess is that the speed difference probably comes primarily from the matchfinder.
Something like a bad constant propagation, which would not allow the compiler to properly optimize a specific instance, thus leaving more branches than necessary.
Just a guess though, we don't use MSVC, so we are lacking first hand debugging capabilities.
I compared the assembly code generated by MSVC and GCC, ZSTD_hashPtr function was not expanded by MSVC.
This patch fixes the problem:
lib/compress/zstd_compress_internal.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/compress/zstd_compress_internal.h b/lib/compress/zstd_compress_internal.h
index db73f6c..52eea30 100644
--- a/lib/compress/zstd_compress_internal.h
+++ b/lib/compress/zstd_compress_internal.h
@@ -626,7 +626,7 @@ static const U64 prime8bytes = 0xCF1BBCDCB7A56463ULL;
static size_t ZSTD_hash8(U64 u, U32 h) { return (size_t)(((u) * prime8bytes) >> (64-h)) ; }
static size_t ZSTD_hash8Ptr(const void* p, U32 h) { return ZSTD_hash8(MEM_readLE64(p), h); }
-MEM_STATIC size_t ZSTD_hashPtr(const void* p, U32 hBits, U32 mls)
+FORCE_INLINE_TEMPLATE size_t ZSTD_hashPtr(const void* p, U32 hBits, U32 mls)
{
switch(mls)
{
Compress the file mentioned above, the output buffer size is ZSTD_compressBound():
before after
test code Win10 14.6s 11.9s (32-bit build)
test code WSL2 4.39s ---- (64-bit build)
pyzstd Win10 6.9s 4.86s (64-bit build)
pyzstd WSL2 4.6s ---- (64-bit build)
I'm not going to submit a PR, since there are still some MEM_STATIC functions in file zstd_compress_internal.h, please see what else needs to be changed.
FYI, assembly code of lib/compress/zstd_double_fast.c generated by MSVC:
https://github.com/animalize/misc/blob/master/zstd_double_fast.asm
I went ahead and resurrected an old file system with Windows 10 installed on a home computer.
It contains both a version of Visual Studio and MinGW64.
Not exactly "up to date", but not horribly out-of-date either :
this is Visual Studio 2017, vs gcc v9.1.0.
First thing : compare raw performance of these compilers, using zstd internal benchmark.
Gosh, the performance difference sure is huge.
To give a single data point, at level 3 (default), VS2017 compression speed is 140 MB/s, while gcc9.1 is 200 MB/s. That's way too large.
Decompression speed difference is even worse : VS2017 offers ~620 MB/s, while gcc9.1 reaches ... 1100 MB/s. Almost 2x faster.
So there is a lot of room to improve on MSVC.
I then tested your suggested change for ZSTD_hashPtr().
It works great, and produce nice benefits for MSVC.
Level 3 is the one offering the best benefit, moving up from 140 to 180 MB/s . Still slower than gcc (200 MB/s), but a lot closer, and firmly within "acceptable" range.
Level 3 gets the most benefit from this change.
Level 1 get some, but it's less impressive : from 280 -> 310 MB/s, compared to gcc's 380 MB/s.
Starting level 5, there may be some minor benefit, but it's difficult to distinguish in the noise. And there sure is a lot of noise on this Windows platform.
I also checked that performance change on gcc is basically neutral.
Overall, this seems like a positive change, at worst neutral, so I believe it could be integrated zstd "as is", while further improvements could be provided later on.
there are still some MEM_STATIC functions in file zstd_compress_internal.h
Things are not as simple as just transferring all MEM_STATIC into FORCE_INLINE_TEMPLATE.
In many cases, such a change doesn't improve performance, or even reduce it.
In all cases, it negatively impacts binary size (or at best is neutral), with corresponding impact on instruction cache.
In theory, we are supposed to "trust" the compiler, about making the correct choice regarding inlining.
They are supposed to know the cost and benefit of this operation, which varies depending on other choices made by the compiler (inlining might be a good choice for one compiler, and a bad one for another compiler).
In practice, we know this is not true, and compilers can be especially bad at inlining large functions which are drastically simplified thanks to constant propagation, even though it's a major performance boost.
Small functions like ZSTD_hashPtr() though are supposed easier to evaluate.
But then, it's MSVC, maybe it requires more manual guiding.
Also,
regarding the issue at hand,
aka the difference in performance between dstCapacity = ZSTD_compressBound() and dstCapacity = ZSTD_compressBound()-1 when invoking ZSTD_compressStream2(,,ZSTD_e_end),
I cannot reproduce the reported issue.
On my test system (using MSVC2017), using ZSTD_compressBound() correctly leads to slightly improved performance, corresponding to ZSTD_compress2() on the same data with same parameters, while ZSTD_compressBound()-1 leads to slightly slower performance (~-5%) due to the need to employ intermediate buffers to guarantee compression success.
Tried multiple times, multiple sources and buffer sizes, this is consistent with expectation.
However, it doesn't match the first report in this issue.
I can't explain yet this difference in observation. Could it be specific to MSVC2019 ? Or the way workload is measured ?
I uploaded assembly code of zstd v1.4.5 generated by MSVC2019, hope it can help you to check other MEM_STATIC functions.
https://github.com/animalize/zstd_asm
regarding the issue at hand, aka the difference in performance between
dstCapacity = ZSTD_compressBound()anddstCapacity = ZSTD_compressBound()-1when invokingZSTD_compressStream2(,,ZSTD_e_end)
I cannot reproduce the reported issue.
Could it because you test after applying the patch?
Before applying the patch, I also got a ~5% difference when compiled with GCC.
Cygwin64 (gcc 9.3.0):
ZSTD_compressBound() - 1
output size: 286842128, time: 4530 ms
ZSTD_compressBound()
output size: 286842126, time: 4376 ms
On Windows 10 WSL2, Ubuntu 9.3.0-17ubuntu1~20.04:
ZSTD_compressBound() - 1
output size: 286842128, time: 4578125 ms (seems microsecond)
ZSTD_compressBound()
output size: 286842126, time: 4390625 ms (seems microsecond)
Moreover, I found Win10 build still slower than WSL2 build (4.85s -> 4.58s), maybe other sites can be improved as well.
before after
pyzstd Win10 6.9s 4.85s
pyzstd WSL2 4.58s ----
(best of 5 results)
Could it because you test after applying the patch?
I tried both with and without the patch, and compiled with MSVC2017.
The speed difference direction remains the same, aka dstCapacity == ZSTD_compressBound() is faster than dstCapacity == ZSTD_compressBound() -1.
The delta is ~4% without MEM_STATIC, and reduced to ~3% with FORCE_INLINE_TEMPLATE, still roughly equivalent.
Could you have a look at the assembly code of ZSTD_compressBlock_doubleFast_generic function generated by MSVC2017? In MSVC2019 the function ZSTD_hashPtr() was not expanded:
https://github.com/animalize/zstd_asm/blob/master/v1.4.5_msvc2019/zstd_double_fast.asm#L557
With MEM_STATIC, it does call ZSTD_hashPtr.
There's no question regarding the FORCE_INLINE_ATTR impact : it does inline ZSTD_hashPtr.
My observation is solely focused on the impact of dstCapacity, and it's roughly the same whether FORCE_INLINE_ATTR is present or not.
Does it execute different code path?
What I tested was a 500 MB input file, all parameters were default (unset), compress/zstd_double_fast.c : ZSTD_compressBlock_doubleFast_generic was the low performance code executed.
If add FORCE_INLINE_ATTR to ZSTD_getLowestPrefixIndex() function, speed up a little:
(edit, sorry, this was a mistake.)
4.83s -> 4.76s
(best of 5 results)
ZSTD_getLowestPrefixIndex() function:
/**
* Returns the lowest allowed match index in the prefix.
*/
MEM_STATIC FORCE_INLINE_ATTR
U32 ZSTD_getLowestPrefixIndex(const ZSTD_matchState_t* ms, U32 current, unsigned windowLog)
{
U32 const maxDistance = 1U << windowLog;
U32 const lowestValid = ms->window.dictLimit;
U32 const withinWindow = (current - lowestValid > maxDistance) ? current - maxDistance : lowestValid;
U32 const isDictionary = (ms->loadedDictEnd != 0);
U32 const matchLowest = isDictionary ? lowestValid : withinWindow;
return matchLowest;
}
Don't know if it will slow down for other compilers.
Maybe adding a MSVC_FORCE_INLINE_ATTR macro will reduce uncertainty?
It looks to me that it should be possible to simplify this function.
That being said, it's only invoked once per block,
therefore, it should not be in position to measurably impact performance.
It's faster almost every time, so it may not be because of noise.
(edit, sorry, this was a mistake, adding the attr to ZSTD_getLowestPrefixIndex() has no effect.)
There are dozens of MEM_STATIC functions, maybe some of them can be added FORCE_INLINE_ATTR, hope the performance of MSVC build can be closer to GCC build.
(edit, I gradually added FORCE_INLINE_ATTR to MEM_STATIC functions, the performance did not change significantly.)
Only add FORCE_INLINE_ATTR to ZSTD_hashPtr() function:
(output buffer size is ZSTD_compressBound())
before after
test code 32bit 14.6s 11.9s
test code 64bit 6.9s 4.84s
test code WSL2 4.39s ----
(best of ~5 results)
6 MEM_readLE16() calls was not expanded in huf_decompress.c, this should be not expected:
https://github.com/animalize/zstd_asm/blob/master/v1.4.5_msvc2019/huf_decompress.asm#L6499
MEM_STATIC U16 MEM_readLE16(const void* memPtr)
{
if (MEM_isLittleEndian())
return MEM_read16(memPtr);
else {
const BYTE* p = (const BYTE*)memPtr;
return (U16)(p[0] + (p[1]<<8));
}
}
MEM_readLE16() function was compiled to this assembly code.
MSVC is not very smart in this case.
; File E:\dev\pyzstd\lib\common\mem.h
_TEXT SEGMENT
memPtr$ = 8
MEM_readLE16 PROC
; 320 : if (MEM_isLittleEndian())
; 321 : return MEM_read16(memPtr);
movzx eax, WORD PTR [rcx]
; 322 : else {
; 323 : const BYTE* p = (const BYTE*)memPtr;
; 324 : return (U16)(p[0] + (p[1]<<8));
; 325 : }
; 326 : }
ret 0
MEM_readLE16 ENDP
_TEXT ENDS
Not sure if this helps with your quest to close the performance gap, but Visual Studio 2019 has a new optimization option that might be useful...
See /Ob3, which turns on aggressive inlining...
https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion?view=vs-2019
Not sure if this helps with your quest to close the performance gap, but Visual Studio 2019 has a new optimization option that might be useful...
See /Ob3, which turns on aggressive inlining...
https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion?view=vs-2019
Thanks for this information.
It works, just turn on this option, no code changed: 6.97s -> 4.77s.
It looks to me that inlining ZSTD_hashPtr() produces the bulk of the benefits,
so it's the one change we should make to the source code,
and that's basically the case now, with the merging of @animalize 's PR.
This /Ob3 compilation flag is a pretty good discovery,
thanks for suggesting it @jpapp05.
Using it, instead of the more classical /Ob2, could be done by updating the cmake and visual scripts.
However, I'm concerned by the impact on older versions of Visual Studio.
If they don't support /Ob3 directly, but it becomes equivalent to /Ob2 on such older versions, that's fine.
But if compilation doesn't even accept to start, it's more concerning,
and would require a visual studio version selector.
It looks to me that inlining ZSTD_hashPtr() produces the bulk of the benefits,
so it's the one change we should make to the source code
In some ZSTD_hashPtr() calls, the third argument can't be predicted at compile time, maybe forcing inline will have a bit negative effect in this case:
https://github.com/facebook/zstd/blob/a880ca239b447968493dd2fed3850e766d6305cc/lib/compress/zstd_fast.c#L30-L41
Remember zstd v1.4.6 will be released in near future, maybe this problem can be carefully improved later.
Some ZSTD_hashPtr() don't use inline keyword, please have a look:
(edit, in /contrib/linux-kernel/lib/zstd/compress.c, no inline keyword seems ok.)
Also,
regarding the issue at hand,
aka the difference in performance between dstCapacity = ZSTD_compressBound() and dstCapacity = > ZSTD_compressBound()-1 when invoking ZSTD_compressStream2(,,ZSTD_e_end),I cannot reproduce the reported issue.
Maybe due to MSVC2017, when using MSVC2017 I also get the result:
MSVC2017 MSVC2019
ZSTD_compressBound() 6.4 sec 6.9 sec
ZSTD_compressBound()-1 7.0 sec 5.0 sec
(zstd v1.4.5, no code changed.)
When I have time, I will write a script to find uninlined functions in MSVC, maybe the performance can be improved a little.
I will write a script to find uninlined functions in MSVC
I'm doing this, hope it can be completed in two or three weeks, as a good pastime. :)
Most helpful comment
I compared the assembly code generated by MSVC and GCC,
ZSTD_hashPtrfunction was not expanded by MSVC.This patch fixes the problem:
Compress the file mentioned above, the output buffer size is
ZSTD_compressBound():I'm not going to submit a PR, since there are still some
MEM_STATICfunctions in filezstd_compress_internal.h, please see what else needs to be changed.FYI, assembly code of
lib/compress/zstd_double_fast.cgenerated by MSVC:https://github.com/animalize/misc/blob/master/zstd_double_fast.asm