Alpaka: math test failed with Clang as CUDA compiler

Created on 2 Jun 2020 · 24Comments · Source: alpaka-group/alpaka

cmake -DBoost_USE_STATIC_LIBS=ON -DBoost_USE_MULTITHREADED=ON -DBoost_USE_STATIC_RUNTIME=OFF -DCMAKE_BUILD_TYPE=Release -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLE=ON -DALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLE=OFF -DALPAKA_ACC_CPU_B_SEQ_T_FIBERS_ENABLE=OFF -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLE=OFF -DALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLE=OFF -DALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLE=OFF -DALPAKA_ACC_CPU_BT_OMP4_ENABLE=OFF -DALPAKA_ACC_GPU_CUDA_ENABLE=ON -DALPAKA_ACC_GPU_HIP_ENABLE=OFF -DALPAKA_DEBUG=0 -DALPAKA_CUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -Dalpaka_BUILD_EXAMPLES=ON -DBUILD_TESTING=ON ..

build/test/unit/math/math
using seed: 1337

testing:
 3 - accelerators !
17 - unary math operators
6 - binary math operators
testing with two data types
total 2 * accelerators * (unary + binary) * capacity


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
math is a Catch v2.11.0 host application.
Run with -? for options

-------------------------------------------------------------------------------
mathOps
-------------------------------------------------------------------------------
/alpaka/test/unit/math/src/math.cpp:173
...............................................................................

/alpaka/test/unit/math/src/math.cpp:144: FAILED:
  REQUIRE( results(i) == Approx(std_result) )
with expansion:
  -inf == Approx( inf )
with messages:
  Operator: OpExp
  Type: d
  The args buffer: 
  capacity: 1000
  0: [ 0, ]
  1: [ 1.797693134862316e+308, ]
  2: [ -1.797693134862316e+308, ]
  3: [ -866.0227473557505, ]
  4: [ 215.2263814000266, ]
  5: [ -748.4321206839105, ]
  6: [ 642.1264531942572, ]
  7: [ -684.4018728416871, ]
  8: [ 459.689454949301, ]
  9: [ -222.9117748544045, ]
  10: [ 209.1584889667682, ]
  11: [ -292.4749641267939, ]
  12: [ 368.071562313879, ]
  13: [ -215.1164257009424, ]
  14: [ 993.5934513103955, ]
  15: [ -503.7756326294896, ]
  16: [ 289.652593683621, ]
  17: [ -491.8247722903218, ]
  18: [ 58.77905907878117, ]
  19: [ -460.0548273044462, ]
  20: [ 849.1603358834052, ]
  21: [ -320.8197171238272, ]
  22: [ 457.8027599177277, ]
  23: [ -669.7326682649812, ]
  24: [ 236.7513732381438, ]
  25: [ -181.0255020122907, ]
  26: [ 853.430905473642, ]
  27: [ -482.0493555522171, ]
  28: [ 727.3698115061305, ]
# ...

Tested on fwk394 with CUDA 10.1, Clang 10.0, CMake 3.16.5 and Boost 1.73.0 via Spack. It also fails with the Alpaka-CI Docker image (the image is not public available at the moment -> I work on it).

1 Bug Testing

Source

SimeonEhrig

Most helpful comment

When the tests mentioned in my message above (clang-CUDA, no fast-math) fail, it's because two very small values are compared, or one very small and one 0. I feel we use a potentially weird way of comparing the numbers, that would often fail in such a situation.

To give an example, sometimes the double precision pow(343.1018, -16.14939) gives 1.13351e-41 in alpaka and 1.13365e-41 in the standard library and so the test fails, I explain why below. I am not sure why it is not fully consistent, but given each time it fails (there is also a similarly inconsistent case of exp(large_number, negative_large_number)) there is a nearly correct result, not some obviously garbage data, I'm inclined to think it's not a data race issue. But just what alpaka actually returns as the backend implementation returns it.

We do the pattern of alpaka_result == Approx(std_library_result). Approx is from catch2, I've looked at its code and it means that the check is true if any of the following is true:

The absolute difference between the two numbers is within the margin given to Approx. We never set a margin, so it is 0 by default and this check just boils down to alpaka_result == std_library_result in our case. So in our current usage this case does just nothing, as it is a subset of the next check.
The relative difference between the two numbers is within the given epsilon (there is actually another parameter to make this check more generic, but how we use it, that's just relative difference). We never set an epsilon, so it uses its default value, which seems reasonable to me. For the case of small enough numbers, that will also fail unless they are exactly equal. As can be seen for that failing example.

We can try to add a small non-zero margin so that the first check will pass for the case of two nearly, but not exactly equal, small numbers. It is difficult to make a reasonable one though, as then the test becomes too imprecise for some cases.

sbastrakov on 11 Mar 2021

👍2

All 24 comments

From the test result I would say that alpaka::math::exp returns -inf for the case where std::exp returns inf. As we do not see the input value it is hard to tell what went wrong.

BenjaminW3 on 2 Jun 2020

We are using std::numeric_limits<double>::max() as argumet and std::exp(std::numeric_limits<double>::max()) == inf.
May this be different for clang CUDA?

BenjaminW3 on 2 Jun 2020

We should check the specs for this function and also check that we have no cast to float somewhere.

psychocoderHPC on 2 Jun 2020

The container is online. You can simply reproduce the problem with the following script:

docker run --runtime=nvidia -it registry.gitlab.com/hzdr/crp/alpaka-group-container/alpaka-ci:cuda10.1Clang
export CUDA_VISIBLE_DEVICES="1"
git clone https://github.com/alpaka-group/alpaka.git
mkdir alpaka/build && cd alpaka/build
cmake -DBOOST_ROOT=/opt/boost/1.73.0/ -DBOOST_LIBRARYDIR="/opt/boost/1.73.0/lib" -DBoost_USE_STATIC_LIBS=ON -DBoost_USE_MULTITHREADED=ON -DBoost_USE_STATIC_RUNTIME=OFF -DCMAKE_BUILD_TYPE=Release -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLE=ON -DALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLE=OFF -DALPAKA_ACC_CPU_B_SEQ_T_FIBERS_ENABLE=OFF -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLE=OFF -DALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLE=OFF -DALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLE=OFF -DALPAKA_ACC_CPU_BT_OMP4_ENABLE=OFF -DALPAKA_ACC_GPU_CUDA_ENABLE=ON -DALPAKA_ACC_GPU_HIP_ENABLE=OFF -DALPAKA_DEBUG=0 -DALPAKA_CUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++-10 -Dalpaka_BUILD_EXAMPLES=ON -DBUILD_TESTING=ON ..
make -j14
test/unit/math/math

SimeonEhrig on 3 Jun 2020

👍1

@SimeonEhrig is it still not fixed (I mean i didn't fix it, but maybe some change in the meanwhile did)? My guess is no one knows, but maybe you do :)

sbastrakov on 9 Mar 2021

Nope, the test is still failing. I updated the container (registry.gitlab.com/hzdr/crp/alpaka-group-container/alpaka-ci-cuda101-clang:1.3) and the cmake configure command:

cmake -DBOOST_ROOT=/opt/boost/1.75.0/ -DBOOST_LIBRARYDIR="/opt/boost/1.75.0/lib" -DBoost_USE_STATIC_LIBS=ON -DBoost_USE_MULTITHREADED=ON -DBoost_USE_STATIC_RUNTIME=OFF -DCMAKE_BUILD_TYPE=Release -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLE=ON -DALPAKA_ACC_GPU_CUDA_ENABLE=ON -DALPAKA_DEBUG=0 -DALPAKA_CUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++-11 -Dalpaka_BUILD_EXAMPLES=ON -DBUILD_TESTING=ON ..

SimeonEhrig on 9 Mar 2021

👍1

I've checked the test on the configuration @SimeonEhrig provided. Calling pow(max, max) with max = std::numeric_limits<double>::max() natively in this configuration results in negative infinity indeed. So it's not alpaka bug or wrong type conversion causing it, but the implementation providing this wrong result.

sbastrakov on 9 Mar 2021

A reproducer if we want to submit a bug:

__global__ void powTestKernel(double value)
{
    double res = ::pow(value, value);
    printf("Native CUDA kernel result: pow(%lf, %lf) = %lf\n", value, value, res);
}

auto const m = std::numeric_limits<double>::max();
powTestKernel<<<1, 1>>>(m);

compiled with clang CUDA as in the docker image outputs
Native CUDA kernel result: pow(179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000, 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000) = -inf
(should be inf, not -inf, and works fine with nvcc)

sbastrakov on 9 Mar 2021

The only this I didn't check is if this is allowed in clang-CUDA with the flags we use. However I would doubt it, as e.g. in the CUDA programming guide there are no exceptions for this usage of pow and so it should adhere to some reasonable error margins.

sbastrakov on 9 Mar 2021

Nope, the test is still failing. I updated the container (registry.gitlab.com/hzdr/crp/alpaka-group-container/alpaka-ci-cuda101-clang:1.3) and the cmake configure command:
cmake -DBOOST_ROOT=/opt/boost/1.75.0/ -DBOOST_LIBRARYDIR="/opt/boost/1.75.0/lib" -DBoost_USE_STATIC_LIBS=ON -DBoost_USE_MULTITHREADED=ON -DBoost_USE_STATIC_RUNTIME=OFF -DCMAKE_BUILD_TYPE=Release -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLE=ON -DALPAKA_ACC_GPU_CUDA_ENABLE=ON -DALPAKA_DEBUG=0 -DALPAKA_CUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++-11 -Dalpaka_BUILD_EXAMPLES=ON -DBUILD_TESTING=ON ..

The reason why these tests fail is the usage of fast math. I tried to fix it in the past by scanning in the CMake alpaka target for fast math option, remove it for the test, and adding it back afterward to not influence other tests. We could also think about disabling fast math for all alpaka tests. The last way would be the easiest.
We have the same issue with HIP-clang.

psychocoderHPC on 11 Mar 2021

Nope, the test is still failing. I updated the container (registry.gitlab.com/hzdr/crp/alpaka-group-container/alpaka-ci-cuda101-clang:1.3) and the cmake configure command:
cmake -DBOOST_ROOT=/opt/boost/1.75.0/ -DBOOST_LIBRARYDIR="/opt/boost/1.75.0/lib" -DBoost_USE_STATIC_LIBS=ON -DBoost_USE_MULTITHREADED=ON -DBoost_USE_STATIC_RUNTIME=OFF -DCMAKE_BUILD_TYPE=Release -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLE=ON -DALPAKA_ACC_GPU_CUDA_ENABLE=ON -DALPAKA_DEBUG=0 -DALPAKA_CUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++-11 -Dalpaka_BUILD_EXAMPLES=ON -DBUILD_TESTING=ON ..
The reason why these tests fail is the usage of fast math. I tried to fix it in the past by scanning in the CMake alpaka target for fast math option, remove it for the test, and adding it back afterward to not influence other tests. We could also think about disabling fast math for all alpaka tests. The last way would be the easiest.
We have the same issue with HIP-clang.

It would also be possible to activate fast math for the stl function we use to create the reference value.

psychocoderHPC on 11 Mar 2021

I tried to fix this issue with precompiler #pragma push and #pragma pop but this functionality is not available in clang if you compile for CUDA or HIP.

psychocoderHPC on 11 Mar 2021

I'm not sure if I understand correctly, but fast math is something, which is handled by the user. Why it is activate by alpaka?

SimeonEhrig on 11 Mar 2021

Here is a snipped how to remove a flag from a target: https://stackoverflow.com/a/49216539

psychocoderHPC on 11 Mar 2021

I'm not sure if I understand correctly, but fast math is something, which is handled by the user. Why it is activate by alpaka?

Yes the user is free to enable or disable fast math. By default, it is enabled for alpaka.
The problem is that our math tests will use whatever we configure in cmake.
The math tests create a result for a math function on the device and one equivalent result on the host by using STL math functions. In the default case where fast math is activated, we compare a math result created on the device with fast math with a result without fast math. This is the reason for the unit test failure.

psychocoderHPC on 11 Mar 2021

Yes the user is free to enable or disable fast math. By default, it is enabled for alpaka.

Isn't this super dangerous? This means any program using alpaka does not have conformant floating point behavior.

bernhardmgruber on 11 Mar 2021

My personal opinion is that the default should be to have fast-math disabled. Having the test fail with fast-math is somehow expected. We could add a better error message for this specific test (runtime or compile time) which makes it obvious that fast-math is the reason for the failure.

BenjaminW3 on 11 Mar 2021

I believe the fast math is not the reason for this test failure. According to the CUDA programming guide (section F) double-precision pow is not affected by fast math to begin with. But even for the single-precision one, which is affected, it seems this behaviour should not happen (with single-precision max instead).

sbastrakov on 11 Mar 2021

My personal opinion is that the default should be to have fast-math disabled. Having the test fail with fast-math is somehow expected. We could add a better error message for this specific test (runtime or compile time) which makes it obvious that fast-math is the reason for the failure.

Setting the default to OFF will not solve the problem. If the user is enabling fast math and executing the tests they will fail again.
Changing the default will also affect all existing clients of alpaka too. I do not see a need to change the default behavior and enforce all clients to handle it. It is a conventional issue in our unit tests.

psychocoderHPC on 11 Mar 2021

Yes the user is free to enable or disable fast math. By default, it is enabled for alpaka.

Isn't this super dangerous? This means any program using alpaka does not have conformant floating point behavior.

Everyone should be aware that fast math will not give the same results on each device.
You can always disable this behavior.
INTEL's icc is enabling fast math with -O3 where gcc is enabling ist with -Ofast.

Enabling or disabling fast math is independent of this issue. The math tests should always be executed without fast math to ensure we have reproducible results.

psychocoderHPC on 11 Mar 2021

It is true. However, "not same" is not same as "completely wrong"

sbastrakov on 11 Mar 2021

Not sure if https://github.com/alpaka-group/alpaka/pull/1190 is kicking into this topic too.
You posted here the issue with pow. We need to check the exponent.

psychocoderHPC on 11 Mar 2021

I've tried removing fast-math. On the docker described above, that test no longer fails. But there are two math test cases that fail sometimes, i.e. some runs go full green, and some have a fail in one of those two tests. That's a really bad behavior for CI

sbastrakov on 11 Mar 2021

We do the pattern of alpaka_result == Approx(std_library_result). Approx is from catch2, I've looked at its code and it means that the check is true if any of the following is true:

The absolute difference between the two numbers is within the margin given to Approx. We never set a margin, so it is 0 by default and this check just boils down to alpaka_result == std_library_result in our case. So in our current usage this case does just nothing, as it is a subset of the next check.
The relative difference between the two numbers is within the given epsilon (there is actually another parameter to make this check more generic, but how we use it, that's just relative difference). We never set an epsilon, so it uses its default value, which seems reasonable to me. For the case of small enough numbers, that will also fail unless they are exactly equal. As can be seen for that failing example.

sbastrakov on 11 Mar 2021

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Boost 1.67.0 beta1

ax3l · 5Comments

CMake CUDA: 'target_compile_options' has no effect

tdd11235813 · 5Comments

Travis GitHub Marketplace

ax3l · 5Comments

Clang still fails

tdd11235813 · 4Comments

Make sure that the kernel function returns void with `ALPAKA_ACC_GPU_CUDA_ONLY_MODE`

BenjaminW3 · 5Comments