Stl: <bit>: Concern about lzcnt used as bsr correctness

Created on 29 Jul 2020 · 18Comments · Source: microsoft/STL

@statementreply asks:

I’m concerned about the correctness of our x86/x64 implementation of countl_zero, on CPUs without LZCNT support.

// We use lzcnt (actually bsr if lzcnt is not supported)

bsr(x) == integer_width - 1 - lzcnt(x) when x != 0, so the fallback won’t work.
I currently don’t have access to a computer with pre-AVX2 CPU. Could someone help testing this?

bug fixed

Source

AlexGuteniev

Most helpful comment

Like this:

template <class _Ty>
_NODISCARD int _Unchecked_x86_x64_countl_zero(const _Ty _Val) noexcept {
    constexpr int _Digits  = numeric_limits<_Ty>::digits;

    // We use lzcnt (actually bsr if lzcnt is not supported) now that we know
    // we're not zero. We can do this because lzcnt and bsr share the same instruction
    // encoding.
    if constexpr (_Digits <= 16) {
        return static_cast<int>(__lzcnt16(_Val) - (16 - _Digits));
    } else if constexpr (_Digits == 32) {
        return static_cast<int>(__lzcnt(_Val));
    } else {
#ifdef _M_IX86
        static_assert(_Digits <= 32, "Should have handled this in _Checked_x86_x64_countl_zero");
#else // ^^^ _M_IX86 / !_M_IX86 vvv
        return static_cast<int>(__lzcnt64(_Val));
#endif // _M_IX86
    }
    // note: we don't need to call a fallback here because
    // all supported x86 processors at least have bsr/bsf
}

template <class _Ty>
_NODISCARD int _Checked_x86_x64_countl_zero(const _Ty _Val) noexcept {
    if constexpr (sizeof(_Ty) > sizeof(void*)) {
        const unsigned int _High = _Val >> 32;
        const auto _Low          = static_cast<unsigned int>(_Val);
        if (_High == 0) {
            return 32 + _Checked_x86_x64_countl_zero(_Low);
        } else {
            return _Checked_x86_x64_countl_zero(_High);
        }
    } else {
        int _Result            = _Unchecked_x86_x64_countl_zero(_Val);
#ifndef __AVX2__
        const bool _Have_lzcnt = __have_abm;
        // lzcnt (when it doesn't fall back to bsr) is defined correctly for zero
        // bsr has undefined output for zero
        if (!_Have_lzcnt) {
            if (_Val == 0) {
                _Result = 0;
            } else {
                // bsr counts from least significant bit, lzcnt counts from most significant bit, apply correction
                constexpr int _Digits = numeric_limits<_Ty>::digits;
                _Result               = _Digits - 1 - _Result;
            }
        }
#endif // __AVX2__
        return _Result;
    }
}

AlexGuteniev on 29 Jul 2020

👍2

All 18 comments

I believe this is indeed a bug; we’ll need pre-Haswell hardware to validate a fix.

StephanTLavavej on 29 Jul 2020

I have AMD FX8300, which doesn't support AVX2, only first AVX. If you write test I can compile and run it...

fsb4000 on 29 Jul 2020

The program that demonstrates the problem is as follows:

#include <intrin.h>
#include <iostream>

int main()
{
    volatile unsigned i = 0x0FFF'FFFC;

    unsigned long bsf;
    _BitScanForward(&bsf, i);
    unsigned tzcnt = __builtin_ctz(i);

    unsigned long bsr;
    _BitScanReverse(&bsr, i);
    unsigned lzcnt = __lzcnt(i);

    std::cout << "BSF:" << bsf << " TZCNT:" << tzcnt << " BSR:" << bsr << " LZCNT:" << lzcnt << "\n";
}

Expected output with LZCNT/TZCNT support:

BSF:2 TZCNT:2 BSR:27 LZCNT:4

Expected output without support:

BSF:2 TZCNT:2 BSR:27 LZCNT:27

But it would be more useful to validate fix, not bug.
For this, need first the PR itself, and the second run it on old CPU

AlexGuteniev on 29 Jul 2020

I see as a good fix as applying the correction after lzcnt call.
But for this fix to work properly need to expose ABM CPUID bit, and don't rely on __ISA_AVAILABLE_AVX2, since unneccessary correction is also a bug.

As a not so good fix, we can just use _BitScanReverse and correct it always.

AlexGuteniev on 29 Jul 2020

I think the best fix will be to give up on always using the same instruction - it's too awkward to both check for 0 before and possibly adjust after. I think we should simply switch on availability and use either lzcnt or bsr.

CaseyCarter on 29 Jul 2020

I think the best fix will be to give up on always using the same instruction - it's too awkward to both check for 0 before and possibly adjust after. I think we should simply switch on availability and use either lzcnt or bsr.

It can be done as adjust after, and handle 0 case as separate branch of adjustment.

AlexGuteniev on 29 Jul 2020

Like this:

template <class _Ty>
_NODISCARD int _Unchecked_x86_x64_countl_zero(const _Ty _Val) noexcept {
    constexpr int _Digits  = numeric_limits<_Ty>::digits;

    // We use lzcnt (actually bsr if lzcnt is not supported) now that we know
    // we're not zero. We can do this because lzcnt and bsr share the same instruction
    // encoding.
    if constexpr (_Digits <= 16) {
        return static_cast<int>(__lzcnt16(_Val) - (16 - _Digits));
    } else if constexpr (_Digits == 32) {
        return static_cast<int>(__lzcnt(_Val));
    } else {
#ifdef _M_IX86
        static_assert(_Digits <= 32, "Should have handled this in _Checked_x86_x64_countl_zero");
#else // ^^^ _M_IX86 / !_M_IX86 vvv
        return static_cast<int>(__lzcnt64(_Val));
#endif // _M_IX86
    }
    // note: we don't need to call a fallback here because
    // all supported x86 processors at least have bsr/bsf
}

template <class _Ty>
_NODISCARD int _Checked_x86_x64_countl_zero(const _Ty _Val) noexcept {
    if constexpr (sizeof(_Ty) > sizeof(void*)) {
        const unsigned int _High = _Val >> 32;
        const auto _Low          = static_cast<unsigned int>(_Val);
        if (_High == 0) {
            return 32 + _Checked_x86_x64_countl_zero(_Low);
        } else {
            return _Checked_x86_x64_countl_zero(_High);
        }
    } else {
        int _Result            = _Unchecked_x86_x64_countl_zero(_Val);
#ifndef __AVX2__
        const bool _Have_lzcnt = __have_abm;
        // lzcnt (when it doesn't fall back to bsr) is defined correctly for zero
        // bsr has undefined output for zero
        if (!_Have_lzcnt) {
            if (_Val == 0) {
                _Result = 0;
            } else {
                // bsr counts from least significant bit, lzcnt counts from most significant bit, apply correction
                constexpr int _Digits = numeric_limits<_Ty>::digits;
                _Result               = _Digits - 1 - _Result;
            }
        }
#endif // __AVX2__
        return _Result;
    }
}

AlexGuteniev on 29 Jul 2020

👍2

Actually, I think I can finalize this to a PR, I know how to do it without magic static or magic extern...

AlexGuteniev on 29 Jul 2020

Good point. I think we also need to revert the "fixup" for 8-bit arguments in return static_cast<int>(__lzcnt16(_Val) - (16 - _Digits)); before performing the post-adjustment.

CaseyCarter on 29 Jul 2020

Expected output with LZCNT/TZCNT support:
BSF:2 TZCNT:2 BSR:27 LZCNT:4
Expected output without support:
BSF:2 TZCNT:2 BSR:27 LZCNT:27
But it would be more useful to validate fix, not bug.
For this, need first the PR itself, and the second run it on old CPU

Sure, will do. For now, I just do some screenshots to confirm the bug: https://imgur.com/a/zsiBqym

fsb4000 on 29 Jul 2020

❤1

I just do some screenshots to confirm the bug: https://imgur.com/a/zsiBqym

Then it is not old enough, it supports ABM.

AlexGuteniev on 29 Jul 2020

Then it is not old enough, it supports ABM.

Oh, ok.

Sorry, I can't help...

I have another PC with AMD Phenom II 925, but it is also support ABM.

https://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#%22Deneb%22_(C2/C3,_45_nm,_Quad-core)

All models support: MMX, SSE, SSE2, SSE3, SSE4a, ABM, Enhanced 3DNow!, NX bit, AMD64, Cool'n'Quiet, AMD-V

fsb4000 on 29 Jul 2020

I have a Celeron D, it should be missing ABM. It takes time to try it though, not today.

AlexGuteniev on 29 Jul 2020

👍1

I have a laptop with i7-2xxxQM, but not at hand. It could take a few days for me to (physically) try it.

statementreply on 30 Jul 2020

But for this fix to work properly need to expose ABM CPUID bit, and don't rely on __ISA_AVAILABLE_AVX2, since unneccessary correction is also a bug.

It sounds somewhat risky to me. We have to prove that no CPU actually implements LZCNT (or does anything other than BSR) without ABM bit set.

statementreply on 30 Jul 2020

We have to prove that no CPU actually implements LZCNT (or does anything other than BSR) without ABM bit set.

Wikipedia says yes. But it is not full proof, sure. In my pr I did detection by trying.

AlexGuteniev on 30 Jul 2020

🚀1

So, I've implemented LZCNT / BSR dispatch based on AVX2 detection, since the detection is too tricky.

@fsb4000 . you can actually convey useful tests on pre-AVX2 machines you have. Just run P0553R4_bit_rotating_and_counting_functions with my PR. (I did it myself too by just temporary changing code)

AlexGuteniev on 30 Jul 2020

#include <stdio.h>
#include <bit>

using namespace std;

int main()
{
    printf("has_single_bit(2)=%d\n", has_single_bit(2u));
    printf("has_single_bit(3)=%d\n", has_single_bit(3u));
    printf("bit_ceil(34)=%d\n", bit_ceil(34u));
    printf("bit_floor(34)=%d\n", bit_floor(34u));
    printf("bit_width(34)=%d\n", bit_width(34u));
    printf("rotl(34,2)=%d\n", rotl(34u,2));
    printf("rotr(34,2)=%d\n", rotr(34u,2));
    printf("countl_zero(34)=%d\n", countl_zero(34u));
    printf("countl_one(34)=%d\n", countl_one(34u));
    printf("countr_zero(34)=%d\n", countr_zero(34u));
    printf("countr_one(34)=%d\n", countr_one(34u));
    printf("countr_one(35)=%d\n", countr_one(35u));
    printf("popcount(34)=%d\n", popcount(34u));
}

cl /nologo /EHsc /W4 /WX /MDd /std:c++latest main.cpp

main.exe
has_single_bit(2)=1
has_single_bit(3)=0
bit_ceil(34)=64
bit_floor(34)=32
bit_width(34)=6
rotl(34,2)=136
rotr(34,2)=-2147483640
countl_zero(34)=26
countl_one(34)=0
countr_zero(34)=1
countr_one(34)=0
countr_one(35)=2
popcount(34)=2

fsb4000 on 30 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Can more math functions be made constexpr?

Andersama · 16Comments

<algorithm>: Consider using *_Prev_iter(it) instead of *(it - 1)

StephanTLavavej · 10Comments

vNext: Make the abi-breaking-changes branch available on GitHub

jonwil · 24Comments

<fstream>: Breaks with modules?

gmcode · 9Comments

STL: Use normal for-loops when possible

StephanTLavavej · 12Comments