cmake -DBoost_USE_STATIC_LIBS=ON -DBoost_USE_MULTITHREADED=ON -DBoost_USE_STATIC_RUNTIME=OFF -DCMAKE_BUILD_TYPE=Release -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLE=ON -DALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLE=OFF -DALPAKA_ACC_CPU_B_SEQ_T_FIBERS_ENABLE=OFF -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLE=OFF -DALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLE=OFF -DALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLE=OFF -DALPAKA_ACC_CPU_BT_OMP4_ENABLE=OFF -DALPAKA_ACC_GPU_CUDA_ENABLE=ON -DALPAKA_ACC_GPU_HIP_ENABLE=OFF -DALPAKA_DEBUG=0 -DALPAKA_CUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -Dalpaka_BUILD_EXAMPLES=ON -DBUILD_TESTING=ON ..
build/test/unit/math/math
using seed: 1337
testing:
3 - accelerators !
17 - unary math operators
6 - binary math operators
testing with two data types
total 2 * accelerators * (unary + binary) * capacity
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
math is a Catch v2.11.0 host application.
Run with -? for options
-------------------------------------------------------------------------------
mathOps
-------------------------------------------------------------------------------
/alpaka/test/unit/math/src/math.cpp:173
...............................................................................
/alpaka/test/unit/math/src/math.cpp:144: FAILED:
REQUIRE( results(i) == Approx(std_result) )
with expansion:
-inf == Approx( inf )
with messages:
Operator: OpExp
Type: d
The args buffer:
capacity: 1000
0: [ 0, ]
1: [ 1.797693134862316e+308, ]
2: [ -1.797693134862316e+308, ]
3: [ -866.0227473557505, ]
4: [ 215.2263814000266, ]
5: [ -748.4321206839105, ]
6: [ 642.1264531942572, ]
7: [ -684.4018728416871, ]
8: [ 459.689454949301, ]
9: [ -222.9117748544045, ]
10: [ 209.1584889667682, ]
11: [ -292.4749641267939, ]
12: [ 368.071562313879, ]
13: [ -215.1164257009424, ]
14: [ 993.5934513103955, ]
15: [ -503.7756326294896, ]
16: [ 289.652593683621, ]
17: [ -491.8247722903218, ]
18: [ 58.77905907878117, ]
19: [ -460.0548273044462, ]
20: [ 849.1603358834052, ]
21: [ -320.8197171238272, ]
22: [ 457.8027599177277, ]
23: [ -669.7326682649812, ]
24: [ 236.7513732381438, ]
25: [ -181.0255020122907, ]
26: [ 853.430905473642, ]
27: [ -482.0493555522171, ]
28: [ 727.3698115061305, ]
# ...
Tested on fwk394 with CUDA 10.1, Clang 10.0, CMake 3.16.5 and Boost 1.73.0 via Spack. It also fails with the Alpaka-CI Docker image (the image is not public available at the moment -> I work on it).
From the test result I would say that alpaka::math::exp returns -inf for the case where std::exp returns inf. As we do not see the input value it is hard to tell what went wrong.
We are using std::numeric_limits<double>::max() as argumet and std::exp(std::numeric_limits<double>::max()) == inf.
May this be different for clang CUDA?
We should check the specs for this function and also check that we have no cast to float somewhere.
The container is online. You can simply reproduce the problem with the following script:
docker run --runtime=nvidia -it registry.gitlab.com/hzdr/crp/alpaka-group-container/alpaka-ci:cuda10.1Clang
export CUDA_VISIBLE_DEVICES="1"
git clone https://github.com/alpaka-group/alpaka.git
mkdir alpaka/build && cd alpaka/build
cmake -DBOOST_ROOT=/opt/boost/1.73.0/ -DBOOST_LIBRARYDIR="/opt/boost/1.73.0/lib" -DBoost_USE_STATIC_LIBS=ON -DBoost_USE_MULTITHREADED=ON -DBoost_USE_STATIC_RUNTIME=OFF -DCMAKE_BUILD_TYPE=Release -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLE=ON -DALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLE=OFF -DALPAKA_ACC_CPU_B_SEQ_T_FIBERS_ENABLE=OFF -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLE=OFF -DALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLE=OFF -DALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLE=OFF -DALPAKA_ACC_CPU_BT_OMP4_ENABLE=OFF -DALPAKA_ACC_GPU_CUDA_ENABLE=ON -DALPAKA_ACC_GPU_HIP_ENABLE=OFF -DALPAKA_DEBUG=0 -DALPAKA_CUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++-10 -Dalpaka_BUILD_EXAMPLES=ON -DBUILD_TESTING=ON ..
make -j14
test/unit/math/math
@SimeonEhrig is it still not fixed (I mean i didn't fix it, but maybe some change in the meanwhile did)? My guess is no one knows, but maybe you do :)
Nope, the test is still failing. I updated the container (registry.gitlab.com/hzdr/crp/alpaka-group-container/alpaka-ci-cuda101-clang:1.3) and the cmake configure command:
cmake -DBOOST_ROOT=/opt/boost/1.75.0/ -DBOOST_LIBRARYDIR="/opt/boost/1.75.0/lib" -DBoost_USE_STATIC_LIBS=ON -DBoost_USE_MULTITHREADED=ON -DBoost_USE_STATIC_RUNTIME=OFF -DCMAKE_BUILD_TYPE=Release -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLE=ON -DALPAKA_ACC_GPU_CUDA_ENABLE=ON -DALPAKA_DEBUG=0 -DALPAKA_CUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++-11 -Dalpaka_BUILD_EXAMPLES=ON -DBUILD_TESTING=ON ..
I've checked the test on the configuration @SimeonEhrig provided. Calling pow(max, max) with max = std::numeric_limits<double>::max() natively in this configuration results in negative infinity indeed. So it's not alpaka bug or wrong type conversion causing it, but the implementation providing this wrong result.
A reproducer if we want to submit a bug:
__global__ void powTestKernel(double value)
{
double res = ::pow(value, value);
printf("Native CUDA kernel result: pow(%lf, %lf) = %lf\n", value, value, res);
}
auto const m = std::numeric_limits<double>::max();
powTestKernel<<<1, 1>>>(m);
compiled with clang CUDA as in the docker image outputs
Native CUDA kernel result: pow(179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000, 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000) = -inf
(should be inf, not -inf, and works fine with nvcc)
The only this I didn't check is if this is allowed in clang-CUDA with the flags we use. However I would doubt it, as e.g. in the CUDA programming guide there are no exceptions for this usage of pow and so it should adhere to some reasonable error margins.
Nope, the test is still failing. I updated the container (
registry.gitlab.com/hzdr/crp/alpaka-group-container/alpaka-ci-cuda101-clang:1.3) and the cmake configure command:cmake -DBOOST_ROOT=/opt/boost/1.75.0/ -DBOOST_LIBRARYDIR="/opt/boost/1.75.0/lib" -DBoost_USE_STATIC_LIBS=ON -DBoost_USE_MULTITHREADED=ON -DBoost_USE_STATIC_RUNTIME=OFF -DCMAKE_BUILD_TYPE=Release -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLE=ON -DALPAKA_ACC_GPU_CUDA_ENABLE=ON -DALPAKA_DEBUG=0 -DALPAKA_CUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++-11 -Dalpaka_BUILD_EXAMPLES=ON -DBUILD_TESTING=ON ..
The reason why these tests fail is the usage of fast math. I tried to fix it in the past by scanning in the CMake alpaka target for fast math option, remove it for the test, and adding it back afterward to not influence other tests. We could also think about disabling fast math for all alpaka tests. The last way would be the easiest.
We have the same issue with HIP-clang.
Nope, the test is still failing. I updated the container (
registry.gitlab.com/hzdr/crp/alpaka-group-container/alpaka-ci-cuda101-clang:1.3) and the cmake configure command:cmake -DBOOST_ROOT=/opt/boost/1.75.0/ -DBOOST_LIBRARYDIR="/opt/boost/1.75.0/lib" -DBoost_USE_STATIC_LIBS=ON -DBoost_USE_MULTITHREADED=ON -DBoost_USE_STATIC_RUNTIME=OFF -DCMAKE_BUILD_TYPE=Release -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLE=ON -DALPAKA_ACC_GPU_CUDA_ENABLE=ON -DALPAKA_DEBUG=0 -DALPAKA_CUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++-11 -Dalpaka_BUILD_EXAMPLES=ON -DBUILD_TESTING=ON ..The reason why these tests fail is the usage of fast math. I tried to fix it in the past by scanning in the CMake alpaka target for fast math option, remove it for the test, and adding it back afterward to not influence other tests. We could also think about disabling fast math for all alpaka tests. The last way would be the easiest.
We have the same issue with HIP-clang.
It would also be possible to activate fast math for the stl function we use to create the reference value.
I tried to fix this issue with precompiler #pragma push and #pragma pop but this functionality is not available in clang if you compile for CUDA or HIP.
I'm not sure if I understand correctly, but fast math is something, which is handled by the user. Why it is activate by alpaka?
Here is a snipped how to remove a flag from a target: https://stackoverflow.com/a/49216539
I'm not sure if I understand correctly, but fast math is something, which is handled by the user. Why it is activate by alpaka?
Yes the user is free to enable or disable fast math. By default, it is enabled for alpaka.
The problem is that our math tests will use whatever we configure in cmake.
The math tests create a result for a math function on the device and one equivalent result on the host by using STL math functions. In the default case where fast math is activated, we compare a math result created on the device with fast math with a result without fast math. This is the reason for the unit test failure.
Yes the user is free to enable or disable fast math. By default, it is enabled for alpaka.
Isn't this super dangerous? This means any program using alpaka does not have conformant floating point behavior.
My personal opinion is that the default should be to have fast-math disabled. Having the test fail with fast-math is somehow expected. We could add a better error message for this specific test (runtime or compile time) which makes it obvious that fast-math is the reason for the failure.
I believe the fast math is not the reason for this test failure. According to the CUDA programming guide (section F) double-precision pow is not affected by fast math to begin with. But even for the single-precision one, which is affected, it seems this behaviour should not happen (with single-precision max instead).
My personal opinion is that the default should be to have fast-math disabled. Having the test fail with fast-math is somehow expected. We could add a better error message for this specific test (runtime or compile time) which makes it obvious that fast-math is the reason for the failure.
Setting the default to OFF will not solve the problem. If the user is enabling fast math and executing the tests they will fail again.
Changing the default will also affect all existing clients of alpaka too. I do not see a need to change the default behavior and enforce all clients to handle it. It is a conventional issue in our unit tests.
Yes the user is free to enable or disable fast math. By default, it is enabled for alpaka.
Isn't this super dangerous? This means any program using alpaka does not have conformant floating point behavior.
Everyone should be aware that fast math will not give the same results on each device.
You can always disable this behavior.
INTEL's icc is enabling fast math with -O3 where gcc is enabling ist with -Ofast.
Enabling or disabling fast math is independent of this issue. The math tests should always be executed without fast math to ensure we have reproducible results.
It is true. However, "not same" is not same as "completely wrong"
Not sure if https://github.com/alpaka-group/alpaka/pull/1190 is kicking into this topic too.
You posted here the issue with pow. We need to check the exponent.
I've tried removing fast-math. On the docker described above, that test no longer fails. But there are two math test cases that fail sometimes, i.e. some runs go full green, and some have a fail in one of those two tests. That's a really bad behavior for CI
When the tests mentioned in my message above (clang-CUDA, no fast-math) fail, it's because two very small values are compared, or one very small and one 0. I feel we use a potentially weird way of comparing the numbers, that would often fail in such a situation.
To give an example, sometimes the double precision pow(343.1018, -16.14939) gives 1.13351e-41 in alpaka and 1.13365e-41 in the standard library and so the test fails, I explain why below. I am not sure why it is not fully consistent, but given each time it fails (there is also a similarly inconsistent case of exp(large_number, negative_large_number)) there is a nearly correct result, not some obviously garbage data, I'm inclined to think it's not a data race issue. But just what alpaka actually returns as the backend implementation returns it.
We do the pattern of alpaka_result == Approx(std_library_result). Approx is from catch2, I've looked at its code and it means that the check is true if any of the following is true:
Approx. We never set a margin, so it is 0 by default and this check just boils down to alpaka_result == std_library_result in our case. So in our current usage this case does just nothing, as it is a subset of the next check.We can try to add a small non-zero margin so that the first check will pass for the case of two nearly, but not exactly equal, small numbers. It is difficult to make a reasonable one though, as then the test becomes too imprecise for some cases.
Most helpful comment
When the tests mentioned in my message above (clang-CUDA, no
fast-math) fail, it's because two very small values are compared, or one very small and one 0. I feel we use a potentially weird way of comparing the numbers, that would often fail in such a situation.To give an example, sometimes the double precision
pow(343.1018, -16.14939)gives1.13351e-41in alpaka and1.13365e-41in the standard library and so the test fails, I explain why below. I am not sure why it is not fully consistent, but given each time it fails (there is also a similarly inconsistent case ofexp(large_number, negative_large_number)) there is a nearly correct result, not some obviously garbage data, I'm inclined to think it's not a data race issue. But just what alpaka actually returns as the backend implementation returns it.We do the pattern of
alpaka_result == Approx(std_library_result).Approxis from catch2, I've looked at its code and it means that the check is true if any of the following is true:Approx. We never set a margin, so it is 0 by default and this check just boils down toalpaka_result == std_library_resultin our case. So in our current usage this case does just nothing, as it is a subset of the next check.We can try to add a small non-zero margin so that the first check will pass for the case of two nearly, but not exactly equal, small numbers. It is difficult to make a reasonable one though, as then the test becomes too imprecise for some cases.