@statementreply asks:
I鈥檓 concerned about the correctness of our x86/x64 implementation of countl_zero, on CPUs without LZCNT support.
// We use lzcnt (actually bsr if lzcnt is not supported)
bsr(x) == integer_width - 1 - lzcnt(x) when x != 0, so the fallback won鈥檛 work.
I currently don鈥檛 have access to a computer with pre-AVX2 CPU. Could someone help testing this?
I believe this is indeed a bug; we鈥檒l need pre-Haswell hardware to validate a fix.
I have AMD FX8300, which doesn't support AVX2, only first AVX. If you write test I can compile and run it...
The program that demonstrates the problem is as follows:
#include <intrin.h>
#include <iostream>
int main()
{
volatile unsigned i = 0x0FFF'FFFC;
unsigned long bsf;
_BitScanForward(&bsf, i);
unsigned tzcnt = __builtin_ctz(i);
unsigned long bsr;
_BitScanReverse(&bsr, i);
unsigned lzcnt = __lzcnt(i);
std::cout << "BSF:" << bsf << " TZCNT:" << tzcnt << " BSR:" << bsr << " LZCNT:" << lzcnt << "\n";
}
Expected output with LZCNT/TZCNT support:
BSF:2 TZCNT:2 BSR:27 LZCNT:4
Expected output without support:
BSF:2 TZCNT:2 BSR:27 LZCNT:27
But it would be more useful to validate fix, not bug.
For this, need first the PR itself, and the second run it on old CPU
I see as a good fix as applying the correction after lzcnt call.
But for this fix to work properly need to expose ABM CPUID bit, and don't rely on __ISA_AVAILABLE_AVX2, since unneccessary correction is also a bug.
As a not so good fix, we can just use _BitScanReverse and correct it always.
I think the best fix will be to give up on always using the same instruction - it's too awkward to both check for 0 before and possibly adjust after. I think we should simply switch on availability and use either lzcnt or bsr.
I think the best fix will be to give up on always using the same instruction - it's too awkward to both check for
0before and possibly adjust after. I think we should simply switch on availability and use eitherlzcntorbsr.
It can be done as adjust after, and handle 0 case as separate branch of adjustment.
Like this:
template <class _Ty>
_NODISCARD int _Unchecked_x86_x64_countl_zero(const _Ty _Val) noexcept {
constexpr int _Digits = numeric_limits<_Ty>::digits;
// We use lzcnt (actually bsr if lzcnt is not supported) now that we know
// we're not zero. We can do this because lzcnt and bsr share the same instruction
// encoding.
if constexpr (_Digits <= 16) {
return static_cast<int>(__lzcnt16(_Val) - (16 - _Digits));
} else if constexpr (_Digits == 32) {
return static_cast<int>(__lzcnt(_Val));
} else {
#ifdef _M_IX86
static_assert(_Digits <= 32, "Should have handled this in _Checked_x86_x64_countl_zero");
#else // ^^^ _M_IX86 / !_M_IX86 vvv
return static_cast<int>(__lzcnt64(_Val));
#endif // _M_IX86
}
// note: we don't need to call a fallback here because
// all supported x86 processors at least have bsr/bsf
}
template <class _Ty>
_NODISCARD int _Checked_x86_x64_countl_zero(const _Ty _Val) noexcept {
if constexpr (sizeof(_Ty) > sizeof(void*)) {
const unsigned int _High = _Val >> 32;
const auto _Low = static_cast<unsigned int>(_Val);
if (_High == 0) {
return 32 + _Checked_x86_x64_countl_zero(_Low);
} else {
return _Checked_x86_x64_countl_zero(_High);
}
} else {
int _Result = _Unchecked_x86_x64_countl_zero(_Val);
#ifndef __AVX2__
const bool _Have_lzcnt = __have_abm;
// lzcnt (when it doesn't fall back to bsr) is defined correctly for zero
// bsr has undefined output for zero
if (!_Have_lzcnt) {
if (_Val == 0) {
_Result = 0;
} else {
// bsr counts from least significant bit, lzcnt counts from most significant bit, apply correction
constexpr int _Digits = numeric_limits<_Ty>::digits;
_Result = _Digits - 1 - _Result;
}
}
#endif // __AVX2__
return _Result;
}
}
Actually, I think I can finalize this to a PR, I know how to do it without magic static or magic extern...
Good point. I think we also need to revert the "fixup" for 8-bit arguments in return static_cast<int>(__lzcnt16(_Val) - (16 - _Digits)); before performing the post-adjustment.
Expected output with LZCNT/TZCNT support:
BSF:2 TZCNT:2 BSR:27 LZCNT:4Expected output without support:
BSF:2 TZCNT:2 BSR:27 LZCNT:27But it would be more useful to validate fix, not bug.
For this, need first the PR itself, and the second run it on old CPU
Sure, will do. For now, I just do some screenshots to confirm the bug: https://imgur.com/a/zsiBqym
I just do some screenshots to confirm the bug: https://imgur.com/a/zsiBqym
Then it is not old enough, it supports ABM.
Then it is not old enough, it supports ABM.
Oh, ok.
Sorry, I can't help...
I have another PC with AMD Phenom II 925, but it is also support ABM.
https://en.wikipedia.org/wiki/List_of_AMD_Phenom_microprocessors#%22Deneb%22_(C2/C3,_45_nm,_Quad-core)
All models support: MMX, SSE, SSE2, SSE3, SSE4a, ABM, Enhanced 3DNow!, NX bit, AMD64, Cool'n'Quiet, AMD-V
I have a Celeron D, it should be missing ABM. It takes time to try it though, not today.
I have a laptop with i7-2xxxQM, but not at hand. It could take a few days for me to (physically) try it.
But for this fix to work properly need to expose ABM CPUID bit, and don't rely on
__ISA_AVAILABLE_AVX2, since unneccessary correction is also a bug.
It sounds somewhat risky to me. We have to prove that no CPU actually implements LZCNT (or does anything other than BSR) without ABM bit set.
We have to prove that no CPU actually implements LZCNT (or does anything other than BSR) without ABM bit set.
Wikipedia says yes. But it is not full proof, sure. In my pr I did detection by trying.
So, I've implemented LZCNT / BSR dispatch based on AVX2 detection, since the detection is too tricky.
@fsb4000 . you can actually convey useful tests on pre-AVX2 machines you have. Just run P0553R4_bit_rotating_and_counting_functions with my PR. (I did it myself too by just temporary changing code)
#include <stdio.h>
#include <bit>
using namespace std;
int main()
{
printf("has_single_bit(2)=%d\n", has_single_bit(2u));
printf("has_single_bit(3)=%d\n", has_single_bit(3u));
printf("bit_ceil(34)=%d\n", bit_ceil(34u));
printf("bit_floor(34)=%d\n", bit_floor(34u));
printf("bit_width(34)=%d\n", bit_width(34u));
printf("rotl(34,2)=%d\n", rotl(34u,2));
printf("rotr(34,2)=%d\n", rotr(34u,2));
printf("countl_zero(34)=%d\n", countl_zero(34u));
printf("countl_one(34)=%d\n", countl_one(34u));
printf("countr_zero(34)=%d\n", countr_zero(34u));
printf("countr_one(34)=%d\n", countr_one(34u));
printf("countr_one(35)=%d\n", countr_one(35u));
printf("popcount(34)=%d\n", popcount(34u));
}
cl /nologo /EHsc /W4 /WX /MDd /std:c++latest main.cpp
main.exe
has_single_bit(2)=1
has_single_bit(3)=0
bit_ceil(34)=64
bit_floor(34)=32
bit_width(34)=6
rotl(34,2)=136
rotr(34,2)=-2147483640
countl_zero(34)=26
countl_one(34)=0
countr_zero(34)=1
countr_one(34)=0
countr_one(35)=2
popcount(34)=2
Most helpful comment
Like this: