Zstd: blow up errors on Intel Compiler on O2: _BitScanReverse return code not checked

Created on 14 Jan 2020 · 23Comments · Source: facebook/zstd

Platform:
Windows 10
Visual Studio 2010
Intel Compiler 2013

Result:
Some streams blow-up on decompress

Cause:
Incorrect results leading to a blow-up are created by bitstream.h when doing an 02 optimized compile:

MEM_STATIC unsigned BIT_highbit32 (U32 val)
{
    assert(val != 0);
    {
#   if defined(_MSC_VER)   /* Visual */
        unsigned long r=0;
        _BitScanReverse ( &r, val );
        return (unsigned) r;
...
}

_BitScanReverse returns success (!=0) or failure (0). The Microsoft definition is:

__success(return!=0)
BOOLEAN
_BitScanReverse (
    __out DWORD *Index,
    __in DWORD Mask
    );

The Zstd code assumed that 'r' will be left at zero when the success if false. However, I don't think the intrinsic is defined that way. The 'r' value is likely undefined on failure. Certainly, the Intel Compiler is making this assumption during its optimization at level O2 (O1 or below don't have the bug).

This could be an Intel Compiler bug, but it is more likely a bug in the zstd code assuming that 'r' has a valid value on failure.

This code stops giving the wrong value when changed to:

MEM_STATIC unsigned BIT_highbit32 (U32 val)
{
    assert(val != 0);
    {
#   if defined(_MSC_VER)   /* Visual */
        unsigned long r;
        return (_BitScanReverse ( &r, val )) ? (unsigned)r : 0;
...
}

The following Microsoft Intrinsics are used on zstd:

_BitScanForward
_BitScanReverse
_BitScanForward64
_BitScanReverse64

All of these calls will potentially be affected by this bug, and all cased should be fixed.

build issue

Source

audioMirror

👍1

Most helpful comment

ZSTD_count() is just memcmp() that returns an offset

I agree. Unfortunately, that's enough of a difference to make it _not_ a memcmp() invocation.
So it has to be programmed.

This function should see gains from auto-vectorization (if enabled).

I hear that family of arguments from time to time.
I get it that a sufficiently advanced compiler could figure out the logic and provide an optimized variant of its own. But will it ?
Unfortunately, in many cases, it is wishful thinking.

the current function could be beaten on all platforms

Please measure it first. Time and time again, this has proven wrong.
And @terrelln just proved it again merely one message above.

There's also a stark difference between a software presumed to run and developed in a controlled environment, where the impact of the compiler can be accurately determined,
versus a widely distributed open-source project, intended to be integrated into a myriad of different environments. In which case, it's just not possible to expect _all_ development configurations to rely on an impressive set of clever tricks that one best-of-a-kind compiler may achieve locally.

So yes, we have code-optimized variants of simple functions, and experience shows it helps.

That being said, whenever a compiler can do a better job, we are keen to let it express itself. We can always use #ifdef macros selectors. So if one can prove that the byte-by-byte variant is actually faster on a detectable set of compiler, version and environment, we may simply create an #ifdef section for it, provided that the gains are significant enough (+0.5% will definitely not cut it, an additional piece of code incurs a maintenance cost, it must be matched by corresponding benefits).

The benchmark in the PR didn't test the change that was introduced, neither GCC nor CLANG use the path with the MSVC intrinsic,

The PR #2031 states that the benchmark is run on a "windows vm, visual 2019".
This would test the MSVC intrinsic code path.
Additional details can be provided by @bimbashrestha.

_edit_ : indeed, the column titles to not seem to specify a valid MSVC configuration, so I guess additional feedback from @bimbashrestha would be welcomed.

Cyan4973 on 24 Mar 2020

👍2

All 23 comments

Does the proposed new code impacts performance ?

Cyan4973 on 14 Jan 2020

I believe it is unlikely to affect performance. These methods are called when building tables, not during compression or decompression.

There is an extra if-statement, and one less assignment, though depending on how the intrinsic is defined, the optimized result may be the same.

I doubt I will be able to measure a difference, but I will run my own tests on a Microsoft build (I can't compare Intel, because it blows up without the fix) and let you know.

audioMirror on 14 Jan 2020

👍1

I have done tests on compression and decompression using 1MB, 100KB and 10KB file sizes, at level=1, both with and without context re-use.

I ran using statistics, 100 times, discarding top and bottom 10% of runs. I did this a number of times just to be sure of the results. Build was x64 at /O2,

I see no effective difference in timings. If there is, it is definitely less than 0.5% (which is around the standard deviation of my quieter test runs). Compression could be slower, but if so, it is less than 0.3%. I see decompression as identical.

Here are my fixes that I suggest should be put into zstd:

bitstream.h

MEM_STATIC unsigned BIT_highbit32 (U32 val)
{
    assert(val != 0);
    {
#   if defined(_MSC_VER)   /* Visual */
        unsigned long r;
        return (_BitScanReverse ( &r, val )) ? (unsigned)r : 0;
...

zstd_internal.h

MEM_STATIC U32 ZSTD_highbit32(U32 val)   /* compress, dictBuilder, decodeCorpus */
{
    assert(val != 0);
    {
#   if defined(_MSC_VER)   /* Visual */
        unsigned long r;
        return (_BitScanReverse ( &r, val )) ? (unsigned)r : 0;
...

and there are 4 sections changed in
zstd_compress_internal.h

static unsigned ZSTD_NbCommonBytes (size_t val)
{
    if (MEM_isLittleEndian()) {
        if (MEM_64bits()) {
#       if defined(_MSC_VER) && defined(_WIN64)
            unsigned long r;
            return (_BitScanForward64( &r, (U64)val )) ? (unsigned)(r>>3) : 0;
#       elif defined(__GNUC__) && (__GNUC__ >= 4)
            return (__builtin_ctzll((U64)val) >> 3);
#       else
            static const int DeBruijnBytePos[64] = { 0, 0, 0, 0, 0, 1, 1, 2,
                                                     0, 3, 1, 3, 1, 4, 2, 7,
                                                     0, 2, 3, 6, 1, 5, 3, 5,
                                                     1, 3, 4, 4, 2, 5, 6, 7,
                                                     7, 0, 1, 2, 3, 3, 4, 6,
                                                     2, 6, 5, 5, 3, 4, 5, 6,
                                                     7, 1, 2, 4, 6, 4, 4, 5,
                                                     7, 2, 6, 5, 7, 6, 7, 7 };
            return DeBruijnBytePos[((U64)((val & -(long long)val) * 0x0218A392CDABBD3FULL)) >> 58];
#       endif
        } else { /* 32 bits */
#       if defined(_MSC_VER)
            unsigned long r;
            return (_BitScanForward( &r, (U32)val )) ? (unsigned)(r>>3) : 0;
#       elif defined(__GNUC__) && (__GNUC__ >= 3)
            return (__builtin_ctz((U32)val) >> 3);
#       else
            static const int DeBruijnBytePos[32] = { 0, 0, 3, 0, 3, 1, 3, 0,
                                                     3, 2, 2, 1, 3, 2, 0, 1,
                                                     3, 3, 1, 2, 2, 2, 2, 0,
                                                     3, 1, 2, 0, 1, 0, 1, 1 };
            return DeBruijnBytePos[((U32)((val & -(S32)val) * 0x077CB531U)) >> 27];
#       endif
        }
    } else {  /* Big Endian CPU */
        if (MEM_64bits()) {
#       if defined(_MSC_VER) && defined(_WIN64)
            unsigned long r;
            return (_BitScanReverse64( &r, (unsigned long)val )) ? (unsigned)(r>>3) : 0;
#       elif defined(__GNUC__) && (__GNUC__ >= 4)
            return (__builtin_clzll(val) >> 3);
#       else
            unsigned r;
            const unsigned n32 = sizeof(size_t)*4;   /* calculate this way due to compiler complaining in 32-bits mode */
            if (!(val>>n32)) { r=4; } else { r=0; val>>=n32; }
            if (!(val>>16)) { r+=2; val>>=8; } else { val>>=24; }
            r += (!val);
            return r;
#       endif
        } else { /* 32 bits */
#       if defined(_MSC_VER)
            unsigned long r;
            return (_BitScanReverse( &r, (unsigned long)val )) ? (unsigned)(r>>3) : 0;
#       elif defined(__GNUC__) && (__GNUC__ >= 3)
            return (__builtin_clz((U32)val) >> 3);
#       else
            unsigned r;
            if (!(val>>16)) { r=2; val>>=8; } else { r=0; val>>=24; }
            r += (!val);
            return r;
#       endif
    }   }
}

By the way, the same bug with _BitScanReverse have been seen by others, and is explained in this post:

https://stackoverflow.com/questions/41351564/vs-unexpected-optimization-behavior-with-bitscanreverse64-intrinsic

audioMirror on 14 Jan 2020

👍1

Isn't all the "Big Endian CPU" stuff wrong in the existing source?
https://github.com/facebook/zstd/blob/957d59c72105fdb074cdca7b00f5b885e820a015/lib/compress/zstd_compress_internal.h#L485

__builtin_clz returns a count of leading zero bits while _BitScanReverse returns the bit index (clz ^ 31).
Switching _BitScanForward with _BitScanReverse based on endianness seems logically incorrect... are we working with raw memory here? (I haven't look through the code...)

aqrit on 17 Jan 2020

@aqrit I think that is a separate question from this issue.

EDIT: while you can't just change change _BitScanForward to _BitScanReverse when you swap the byte-endian, because these are bitwise operations, the ZSTD_NbCommonBytes() method is finding the position of the first non-zero byte, so it should work. It is reading in a byte stream (into a long or long-long), and returning the offset to the first non-zero byte, else zero, though it can't distinguish between something in byte 0 and nothing in any byte. I assume it is tested and correct on Little-Endian at least.

Is it even possible to get a Microsoft big-endian machine? If not, the code is a bit untestable.

It would be nice if the bug of this case could be fixed, though. Not looking at these return codes is wrong, even if it accidentally works on some compilers.

audioMirror on 17 Jan 2020

👍1

There are shortcuts for this problem: using bsr to find a byte index.
One can guarantee the value of zero is never passed by setting the low bit.
However, it looks like that whole section of code should be deleted.

#if HAVE_FAST_CLZ && defined(__GNUC__) && (__GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4))
# define HIGHEST_BYTE_INDEX(v)                                         \
    v = (((uint32_t)__builtin_clz(v | 1) ^ 31) >> 3)
#else
# define HIGHEST_BYTE_INDEX(v)                                         \
    v >>= 8;                                                           \
    v = ((((v + 0xFFFF00) | 0xFFFFFF) + v) | (v + 0xFF0000)) >> 24
#endif

aqrit on 17 Jan 2020

Is there a way to test / experiment the Intel compiler blow-up effect ? Preferably in some kind of open-source test environment such as TravisCI ?
This would make it possible to continuously ensure that the source code compiles correctly with this compiler.

Cyan4973 on 25 Jan 2020

The Intel compiler is a licensed product. I'm not sure how you can test it. I also had problems with a header compiling on Linux with the Intel compiler (but no code problems).

However, this bug is valid no matter what the compiler (though it only shows up with Intel at the moment).

audioMirror on 18 Feb 2020

You can use it for free

https://github.com/google/safeside/issues/106

davidbolvansky on 21 Feb 2020

🎉1

The gcc equivalent intrinsics all suffer from the same undefined behavior.

This is understood I hope?
@bimbashrestha

aqrit on 11 Mar 2020

@aqrit
you are correct!

Though, some systems may patch it to not be undefined? Not all documentation mentions the undefined, though the official gcc docs seems to.

I will look into fixing that also, for my code. I was experiencing an issue in Centos 8, though this might not be the cause.

audioMirror on 11 Mar 2020

ZSTD_highbit32(U32 val) and BIT_highbit32(U32 val) both assert(val != 0). So is this an ICC bug, ZSTD bug, or invalid bug?

If it is an ICC bug then this should be noted in the source code and the conditional should be guarded by a more specific #ifdef such that MSVC/Clang/Etc. are not effected by the new slower code.

If it is a ZSTD bug then the call sites that passed zero should be examined for correctness.
If zero is indeed valid then the asserts should be deleted and the GCC and IAR intrinsics need to be fixed.

If the bug is invalid then the patch should be rolled back and audioMirror's issue needs to be further examined.

ZSTD_NbCommonBytes(size_t val) is a helper function for ZSTD_count(...). It seems impossible to pass a value of zero to ZSTD_NbCommonBytes from the current call sites. Either a assert(val != 0) should be added or the return value -- if passed zero -- needs to be determined. Should it return sizeof(size_t), sizeof(size_t) - 1, or 0? Currently, it may return garbage for a value of zero.

However, in my humble opinion, ZSTD_NbCommonBytes should be deleted and
ZSTD_count() should be simplified:

MEM_STATIC size_t ZSTD_count(const BYTE* pIn, const BYTE* pMatch, const BYTE* const pInLimit)
{
    size_t i = 0;
    while ((&pIn[i] < pInLimit) && (pIn[i] == pMatch[i])) i++;
    return i;
}

I'm worried that this issue was not understood based on the patch that was landed.
Could one of the projects contributors acknowledge that this comment has been viewed? @Cyan4973 et al.

aqrit on 14 Mar 2020

It is true that there is an assert() to check that zero is not sent, during debug run. I can only tell you that zero absolutely did get sent on some streams, under the Intel compile (optimized only). When zero was sent, the old code blew up, and with the fix, it ran fine.

Was this a compiler bug, an un-initialized variable, a hole in the algorithm, or a corruption of the tables? I'm not sure. I just know it happened.

IMO, protecting the _BitScanForwardXX and the equivalent Linux methods (which weren't done in this issue) is the right thing to do, provided that it doesn't hit performance (and they and I have proved that it doesn't). There is no point in having a "blow me up now" method on certain calls. But, yes, there may be another underlying bug lingering somewhere.

audioMirror on 23 Mar 2020

Indeed, ZSTD_NbCommonBytes() should never be invoked with a value of 0. That's why it's asserted : it should never happen.

If there is a way to generate this situation, we would love to get a reproduction test and analyze it.
It does not look easy to produce : the nb of invocations is reduced (only 2), and they are both protected by a test checking the value.

However, in my humble opinion (...) ZSTD_count() should be simplified:

The compression speed would be seriously impacted if the code was using a naive byte-by-byte comparison method instead.

Cyan4973 on 23 Mar 2020

For reference, comparing ZSTD_count() with gcc-9 on an Intel i9-9900K with a byte-by-byte implementation we get these compression speeds:

| Level | NbCommonBytes | Byte-by-Byte |
|---|---|---|
| 1 | 350 MB/s | 314 MB/s (-10%) |
| 2 | 261 MB/s | 241 MB/s (-8%) |
| 3 | 187 MB/s | 176 MB/s (-6%) |

terrelln on 23 Mar 2020

👍1

protecting (bsf...) provided that it doesn't hit performance (and they and I have proved that it doesn't)

The benchmark in the PR didn't test the change that was introduced, neither GCC nor CLANG use the path with the MSVC intrinsic, though it may still be performance neutral as claimed.

ZSTD_count() with a byte-by-byte implementation

This function should see gains from auto-vectorization (if enabled).
ZSTD_count() is just memcmp() that returns an offset, the current function could be beaten on all platforms. Note: the current function already relies on fast unaligned memory access.

aqrit on 24 Mar 2020

ZSTD_count() is just memcmp() that returns an offset

I agree. Unfortunately, that's enough of a difference to make it _not_ a memcmp() invocation.
So it has to be programmed.

This function should see gains from auto-vectorization (if enabled).

the current function could be beaten on all platforms

Please measure it first. Time and time again, this has proven wrong.
And @terrelln just proved it again merely one message above.

So yes, we have code-optimized variants of simple functions, and experience shows it helps.

The benchmark in the PR didn't test the change that was introduced, neither GCC nor CLANG use the path with the MSVC intrinsic,

The PR #2031 states that the benchmark is run on a "windows vm, visual 2019".
This would test the MSVC intrinsic code path.
Additional details can be provided by @bimbashrestha.

_edit_ : indeed, the column titles to not seem to specify a valid MSVC configuration, so I guess additional feedback from @bimbashrestha would be welcomed.

Cyan4973 on 24 Mar 2020

👍2

Looks like I misunderstood the original issue. The error was reproduced on my windows vm but the benchmarks were not done on a windows vm.

I mistakingly thought the code path was triggered by gcc and clang on non-windows too (hence the benchmark). Sorry for the confusion everyone.

So the benchmarks on that pr are meaningless

bimbashrestha on 24 Mar 2020

👍1

Could this be run again on a Windows VM using MSVC compiler ?

The assumption is that the code change will not impact speed in any meaningful way.

Possibly a better outcome : it may even happen that the assembly generated by the compiler ends up being rigorously identical, since an input of 0 (zero) is not possible, something that the compiler might be able to understand. This could be checked with md5sum for example (assuming no trace of the source code is bundled into the binary).

Cyan4973 on 24 Mar 2020

This function should see gains from auto-vectorization (if enabled).

The benchmark I ran was compiled with gcc -O3 which has auto-vectorization enabled, but is still much worse than the hand-written version.

terrelln on 24 Mar 2020

We could always code performance critical parts with compiler instrinsics..

davidbolvansky on 25 Mar 2020

👍1

We could always code performance critical parts with compiler instrinsics.

That is what ZSTD_count() is doing. It is using __builtin_clz() to efficiently count the number of matching bytes.

terrelln on 25 Mar 2020

👍1

FWIW,
ZSTD_count w/avx2 intrinsics is worth about 1.75% on the low compression levels.
https://gist.github.com/aqrit/f83a5004110cf5caee4839566cfa5fb8

aqrit on 30 Mar 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Zstd multithreaded output can depend on number of threads

terrelln · 3Comments

Is there a way to search a compressed archive?

g666gle · 3Comments

I'd like a function to produce a frame epilogue on the final frame of a stream.

xorgy · 3Comments

Can't get the squashfs functionality working

escalade · 3Comments

Is zstd going to be added to .ZIP file format standard?

TheSil · 3Comments