Platform:
Windows 10
Visual Studio 2010
Intel Compiler 2013
Result:
Some streams blow-up on decompress
Cause:
Incorrect results leading to a blow-up are created by bitstream.h when doing an 02 optimized compile:
MEM_STATIC unsigned BIT_highbit32 (U32 val)
{
assert(val != 0);
{
# if defined(_MSC_VER) /* Visual */
unsigned long r=0;
_BitScanReverse ( &r, val );
return (unsigned) r;
...
}
_BitScanReverse returns success (!=0) or failure (0). The Microsoft definition is:
__success(return!=0)
BOOLEAN
_BitScanReverse (
__out DWORD *Index,
__in DWORD Mask
);
The Zstd code assumed that 'r' will be left at zero when the success if false. However, I don't think the intrinsic is defined that way. The 'r' value is likely undefined on failure. Certainly, the Intel Compiler is making this assumption during its optimization at level O2 (O1 or below don't have the bug).
This could be an Intel Compiler bug, but it is more likely a bug in the zstd code assuming that 'r' has a valid value on failure.
This code stops giving the wrong value when changed to:
MEM_STATIC unsigned BIT_highbit32 (U32 val)
{
assert(val != 0);
{
# if defined(_MSC_VER) /* Visual */
unsigned long r;
return (_BitScanReverse ( &r, val )) ? (unsigned)r : 0;
...
}
The following Microsoft Intrinsics are used on zstd:
_BitScanForward
_BitScanReverse
_BitScanForward64
_BitScanReverse64
All of these calls will potentially be affected by this bug, and all cased should be fixed.
Does the proposed new code impacts performance ?
I believe it is unlikely to affect performance. These methods are called when building tables, not during compression or decompression.
There is an extra if-statement, and one less assignment, though depending on how the intrinsic is defined, the optimized result may be the same.
I doubt I will be able to measure a difference, but I will run my own tests on a Microsoft build (I can't compare Intel, because it blows up without the fix) and let you know.
I have done tests on compression and decompression using 1MB, 100KB and 10KB file sizes, at level=1, both with and without context re-use.
I ran using statistics, 100 times, discarding top and bottom 10% of runs. I did this a number of times just to be sure of the results. Build was x64 at /O2,
I see no effective difference in timings. If there is, it is definitely less than 0.5% (which is around the standard deviation of my quieter test runs). Compression could be slower, but if so, it is less than 0.3%. I see decompression as identical.
Here are my fixes that I suggest should be put into zstd:
bitstream.h
MEM_STATIC unsigned BIT_highbit32 (U32 val)
{
assert(val != 0);
{
# if defined(_MSC_VER) /* Visual */
unsigned long r;
return (_BitScanReverse ( &r, val )) ? (unsigned)r : 0;
...
zstd_internal.h
MEM_STATIC U32 ZSTD_highbit32(U32 val) /* compress, dictBuilder, decodeCorpus */
{
assert(val != 0);
{
# if defined(_MSC_VER) /* Visual */
unsigned long r;
return (_BitScanReverse ( &r, val )) ? (unsigned)r : 0;
...
and there are 4 sections changed in
zstd_compress_internal.h
static unsigned ZSTD_NbCommonBytes (size_t val)
{
if (MEM_isLittleEndian()) {
if (MEM_64bits()) {
# if defined(_MSC_VER) && defined(_WIN64)
unsigned long r;
return (_BitScanForward64( &r, (U64)val )) ? (unsigned)(r>>3) : 0;
# elif defined(__GNUC__) && (__GNUC__ >= 4)
return (__builtin_ctzll((U64)val) >> 3);
# else
static const int DeBruijnBytePos[64] = { 0, 0, 0, 0, 0, 1, 1, 2,
0, 3, 1, 3, 1, 4, 2, 7,
0, 2, 3, 6, 1, 5, 3, 5,
1, 3, 4, 4, 2, 5, 6, 7,
7, 0, 1, 2, 3, 3, 4, 6,
2, 6, 5, 5, 3, 4, 5, 6,
7, 1, 2, 4, 6, 4, 4, 5,
7, 2, 6, 5, 7, 6, 7, 7 };
return DeBruijnBytePos[((U64)((val & -(long long)val) * 0x0218A392CDABBD3FULL)) >> 58];
# endif
} else { /* 32 bits */
# if defined(_MSC_VER)
unsigned long r;
return (_BitScanForward( &r, (U32)val )) ? (unsigned)(r>>3) : 0;
# elif defined(__GNUC__) && (__GNUC__ >= 3)
return (__builtin_ctz((U32)val) >> 3);
# else
static const int DeBruijnBytePos[32] = { 0, 0, 3, 0, 3, 1, 3, 0,
3, 2, 2, 1, 3, 2, 0, 1,
3, 3, 1, 2, 2, 2, 2, 0,
3, 1, 2, 0, 1, 0, 1, 1 };
return DeBruijnBytePos[((U32)((val & -(S32)val) * 0x077CB531U)) >> 27];
# endif
}
} else { /* Big Endian CPU */
if (MEM_64bits()) {
# if defined(_MSC_VER) && defined(_WIN64)
unsigned long r;
return (_BitScanReverse64( &r, (unsigned long)val )) ? (unsigned)(r>>3) : 0;
# elif defined(__GNUC__) && (__GNUC__ >= 4)
return (__builtin_clzll(val) >> 3);
# else
unsigned r;
const unsigned n32 = sizeof(size_t)*4; /* calculate this way due to compiler complaining in 32-bits mode */
if (!(val>>n32)) { r=4; } else { r=0; val>>=n32; }
if (!(val>>16)) { r+=2; val>>=8; } else { val>>=24; }
r += (!val);
return r;
# endif
} else { /* 32 bits */
# if defined(_MSC_VER)
unsigned long r;
return (_BitScanReverse( &r, (unsigned long)val )) ? (unsigned)(r>>3) : 0;
# elif defined(__GNUC__) && (__GNUC__ >= 3)
return (__builtin_clz((U32)val) >> 3);
# else
unsigned r;
if (!(val>>16)) { r=2; val>>=8; } else { r=0; val>>=24; }
r += (!val);
return r;
# endif
} }
}
By the way, the same bug with _BitScanReverse have been seen by others, and is explained in this post:
Isn't all the "Big Endian CPU" stuff wrong in the existing source?
https://github.com/facebook/zstd/blob/957d59c72105fdb074cdca7b00f5b885e820a015/lib/compress/zstd_compress_internal.h#L485
__builtin_clz returns a count of leading zero bits while _BitScanReverse returns the bit index (clz ^ 31).
Switching _BitScanForward with _BitScanReverse based on endianness seems logically incorrect... are we working with raw memory here? (I haven't look through the code...)
@aqrit I think that is a separate question from this issue.
EDIT: while you can't just change change _BitScanForward to _BitScanReverse when you swap the byte-endian, because these are bitwise operations, the ZSTD_NbCommonBytes() method is finding the position of the first non-zero byte, so it should work. It is reading in a byte stream (into a long or long-long), and returning the offset to the first non-zero byte, else zero, though it can't distinguish between something in byte 0 and nothing in any byte. I assume it is tested and correct on Little-Endian at least.
Is it even possible to get a Microsoft big-endian machine? If not, the code is a bit untestable.
It would be nice if the bug of this case could be fixed, though. Not looking at these return codes is wrong, even if it accidentally works on some compilers.
There are shortcuts for this problem: using bsr to find a byte index.
One can guarantee the value of zero is never passed by setting the low bit.
However, it looks like that whole section of code should be deleted.
#if HAVE_FAST_CLZ && defined(__GNUC__) && (__GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4))
# define HIGHEST_BYTE_INDEX(v) \
v = (((uint32_t)__builtin_clz(v | 1) ^ 31) >> 3)
#else
# define HIGHEST_BYTE_INDEX(v) \
v >>= 8; \
v = ((((v + 0xFFFF00) | 0xFFFFFF) + v) | (v + 0xFF0000)) >> 24
#endif
Is there a way to test / experiment the Intel compiler blow-up effect ? Preferably in some kind of open-source test environment such as TravisCI ?
This would make it possible to continuously ensure that the source code compiles correctly with this compiler.
The Intel compiler is a licensed product. I'm not sure how you can test it. I also had problems with a header compiling on Linux with the Intel compiler (but no code problems).
However, this bug is valid no matter what the compiler (though it only shows up with Intel at the moment).
You can use it for free
The gcc equivalent intrinsics all suffer from the same undefined behavior.
This is understood I hope?
@bimbashrestha
@aqrit
you are correct!
Though, some systems may patch it to not be undefined? Not all documentation mentions the undefined, though the official gcc docs seems to.
I will look into fixing that also, for my code. I was experiencing an issue in Centos 8, though this might not be the cause.
ZSTD_highbit32(U32 val) and BIT_highbit32(U32 val) both assert(val != 0). So is this an ICC bug, ZSTD bug, or invalid bug?
If it is an ICC bug then this should be noted in the source code and the conditional should be guarded by a more specific #ifdef such that MSVC/Clang/Etc. are not effected by the new slower code.
If it is a ZSTD bug then the call sites that passed zero should be examined for correctness.
If zero is indeed valid then the asserts should be deleted and the GCC and IAR intrinsics need to be fixed.
If the bug is invalid then the patch should be rolled back and audioMirror's issue needs to be further examined.
ZSTD_NbCommonBytes(size_t val) is a helper function for ZSTD_count(...). It seems impossible to pass a value of zero to ZSTD_NbCommonBytes from the current call sites. Either a assert(val != 0) should be added or the return value -- if passed zero -- needs to be determined. Should it return sizeof(size_t), sizeof(size_t) - 1, or 0? Currently, it may return garbage for a value of zero.
However, in my humble opinion, ZSTD_NbCommonBytes should be deleted and
ZSTD_count() should be simplified:
MEM_STATIC size_t ZSTD_count(const BYTE* pIn, const BYTE* pMatch, const BYTE* const pInLimit)
{
size_t i = 0;
while ((&pIn[i] < pInLimit) && (pIn[i] == pMatch[i])) i++;
return i;
}
I'm worried that this issue was not understood based on the patch that was landed.
Could one of the projects contributors acknowledge that this comment has been viewed? @Cyan4973 et al.
It is true that there is an assert() to check that zero is not sent, during debug run. I can only tell you that zero absolutely did get sent on some streams, under the Intel compile (optimized only). When zero was sent, the old code blew up, and with the fix, it ran fine.
Was this a compiler bug, an un-initialized variable, a hole in the algorithm, or a corruption of the tables? I'm not sure. I just know it happened.
IMO, protecting the _BitScanForwardXX and the equivalent Linux methods (which weren't done in this issue) is the right thing to do, provided that it doesn't hit performance (and they and I have proved that it doesn't). There is no point in having a "blow me up now" method on certain calls. But, yes, there may be another underlying bug lingering somewhere.
Indeed, ZSTD_NbCommonBytes() should never be invoked with a value of 0. That's why it's asserted : it should never happen.
If there is a way to generate this situation, we would love to get a reproduction test and analyze it.
It does not look easy to produce : the nb of invocations is reduced (only 2), and they are both protected by a test checking the value.
However, in my humble opinion (...)
ZSTD_count()should be simplified:
The compression speed would be seriously impacted if the code was using a naive byte-by-byte comparison method instead.
For reference, comparing ZSTD_count() with gcc-9 on an Intel i9-9900K with a byte-by-byte implementation we get these compression speeds:
| Level | NbCommonBytes | Byte-by-Byte |
|---|---|---|
| 1 | 350 MB/s | 314 MB/s (-10%) |
| 2 | 261 MB/s | 241 MB/s (-8%) |
| 3 | 187 MB/s | 176 MB/s (-6%) |
protecting (bsf...) provided that it doesn't hit performance (and they and I have proved that it doesn't)
The benchmark in the PR didn't test the change that was introduced, neither GCC nor CLANG use the path with the MSVC intrinsic, though it may still be performance neutral as claimed.
ZSTD_count() with a byte-by-byte implementation
This function should see gains from auto-vectorization (if enabled).
ZSTD_count() is just memcmp() that returns an offset, the current function could be beaten on all platforms. Note: the current function already relies on fast unaligned memory access.
ZSTD_count()is justmemcmp()that returns an offset
I agree. Unfortunately, that's enough of a difference to make it _not_ a memcmp() invocation.
So it has to be programmed.
This function should see gains from auto-vectorization (if enabled).
I hear that family of arguments from time to time.
I get it that a sufficiently advanced compiler could figure out the logic and provide an optimized variant of its own. But will it ?
Unfortunately, in many cases, it is wishful thinking.
the current function could be beaten on all platforms
Please measure it first. Time and time again, this has proven wrong.
And @terrelln just proved it again merely one message above.
There's also a stark difference between a software presumed to run and developed in a controlled environment, where the impact of the compiler can be accurately determined,
versus a widely distributed open-source project, intended to be integrated into a myriad of different environments. In which case, it's just not possible to expect _all_ development configurations to rely on an impressive set of clever tricks that one best-of-a-kind compiler may achieve locally.
So yes, we have code-optimized variants of simple functions, and experience shows it helps.
That being said, whenever a compiler can do a better job, we are keen to let it express itself. We can always use #ifdef macros selectors. So if one can prove that the byte-by-byte variant is actually faster on a detectable set of compiler, version and environment, we may simply create an #ifdef section for it, provided that the gains are significant enough (+0.5% will definitely not cut it, an additional piece of code incurs a maintenance cost, it must be matched by corresponding benefits).
The benchmark in the PR didn't test the change that was introduced, neither GCC nor CLANG use the path with the MSVC intrinsic,
The PR #2031 states that the benchmark is run on a "windows vm, visual 2019".
This would test the MSVC intrinsic code path.
Additional details can be provided by @bimbashrestha.
_edit_ : indeed, the column titles to not seem to specify a valid MSVC configuration, so I guess additional feedback from @bimbashrestha would be welcomed.
Looks like I misunderstood the original issue. The error was reproduced on my windows vm but the benchmarks were not done on a windows vm.
I mistakingly thought the code path was triggered by gcc and clang on non-windows too (hence the benchmark). Sorry for the confusion everyone.
So the benchmarks on that pr are meaningless
Could this be run again on a Windows VM using MSVC compiler ?
The assumption is that the code change will not impact speed in any meaningful way.
Possibly a better outcome : it may even happen that the assembly generated by the compiler ends up being rigorously identical, since an input of 0 (zero) is not possible, something that the compiler might be able to understand. This could be checked with md5sum for example (assuming no trace of the source code is bundled into the binary).
This function should see gains from auto-vectorization (if enabled).
The benchmark I ran was compiled with gcc -O3 which has auto-vectorization enabled, but is still much worse than the hand-written version.
We could always code performance critical parts with compiler instrinsics..
We could always code performance critical parts with compiler instrinsics.
That is what ZSTD_count() is doing. It is using __builtin_clz() to efficiently count the number of matching bytes.
FWIW,
ZSTD_count w/avx2 intrinsics is worth about 1.75% on the low compression levels.
https://gist.github.com/aqrit/f83a5004110cf5caee4839566cfa5fb8
Most helpful comment
I agree. Unfortunately, that's enough of a difference to make it _not_ a
memcmp()invocation.So it has to be programmed.
I hear that family of arguments from time to time.
I get it that a sufficiently advanced compiler could figure out the logic and provide an optimized variant of its own. But will it ?
Unfortunately, in many cases, it is wishful thinking.
Please measure it first. Time and time again, this has proven wrong.
And @terrelln just proved it again merely one message above.
There's also a stark difference between a software presumed to run and developed in a controlled environment, where the impact of the compiler can be accurately determined,
versus a widely distributed open-source project, intended to be integrated into a myriad of different environments. In which case, it's just not possible to expect _all_ development configurations to rely on an impressive set of clever tricks that one best-of-a-kind compiler may achieve locally.
So yes, we have code-optimized variants of simple functions, and experience shows it helps.
That being said, whenever a compiler can do a better job, we are keen to let it express itself. We can always use
#ifdefmacros selectors. So if one can prove that the byte-by-byte variant is actually faster on a detectable set of compiler, version and environment, we may simply create an#ifdefsection for it, provided that the gains are significant enough (+0.5% will definitely not cut it, an additional piece of code incurs a maintenance cost, it must be matched by corresponding benefits).The PR #2031 states that the benchmark is run on a "windows vm, visual 2019".
This would test the MSVC intrinsic code path.
Additional details can be provided by @bimbashrestha.
_edit_ : indeed, the column titles to not seem to specify a valid MSVC configuration, so I guess additional feedback from @bimbashrestha would be welcomed.