Onnxruntime: Test failure on Windows

Created on 27 Jul 2020 · 21Comments · Source: microsoft/onnxruntime

Describe the bug

[ RUN      ] TensorOpTest.CastFromString
D:\roaster-scratch\onnxruntime\onnxruntime\test\providers\provider_test_utils.cc(153): error: The difference between expected[i] and output[i] is nan, which exceeds threshold, where
expected[i] evaluates to -nan(ind),
output[i] evaluates to nan, and
threshold evaluates to 9.9999997473787516e-05.
i:6, provider_type: CPUExecutionProvider
D:\roaster-scratch\onnxruntime\onnxruntime\test\providers\provider_test_utils.cc(153): error: The difference between expected[i] and output[i] is nan, which exceeds threshold, where
expected[i] evaluates to -nan(ind),
output[i] evaluates to nan, and
threshold evaluates to 9.9999997473787516e-05.
i:7, provider_type: CPUExecutionProvider
[  FAILED  ] TensorOpTest.CastFromString (134 ms)

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Win Server 2019
ONNX Runtime installed from (source or binary): source
ONNX Runtime version: 1.4.0
Python version: 3.8.5
Visual Studio version (if applicable): 2019

Screenshots

windows support

Source

xkszltl

All 21 comments

This is a little strange given these tests pass in CI builds (unit tests are run for all builds).

Looking at the test code the expected value should be NaN so not sure why it is -NaN in your build. Any chance something is redefining the NAN macro in your setup?

https://github.com/microsoft/onnxruntime/blob/222fd08f20765103e3bfbaf5de07b7f25373d516/onnxruntime/test/providers/cpu/tensor/tensor_op_test.cc#L294-L298

The Cast implementation uses std::stof in this case to convert the input so wouldn't be affected by the NAN macro

https://github.com/microsoft/onnxruntime/blob/222fd08f20765103e3bfbaf5de07b7f25373d516/onnxruntime/core/providers/cpu/tensor/cast_op.cc#L108

skottmckay on 27 Jul 2020

No I don't have any macro redefinition for NAN.

Also I noticed there's nan-specific check.
Why it is not captured before going through line 153?
https://github.com/microsoft/onnxruntime/blob/1f69a58105f320e9b79008698b4275fa3f113f2d/onnxruntime/test/providers/provider_test_utils.cc#L147

xkszltl on 27 Jul 2020

Not sure why the nan-specific check isn't being satisfied. It is on my machine, and expected is actually -nan(ind) . So the output from your run for expected and actual match what I see, but the test passes for me as the std::isnan check on line 147 is satisfied.

expected,8
0x000001ecf0c06920 {-inf, inf, 0.976761103, 0.282806963, -0.120196559, 5.00000000, -nan(ind), -nan(ind)}
    [0]: -inf
    [1]: inf
    [2]: 0.976761103
    [3]: 0.282806963
    [4]: -0.120196559
    [5]: 5.00000000
    [6]: -nan(ind)
    [7]: -nan(ind)
output,8
0x000001ecf2d9b080 {-inf, inf, 0.976761103, 0.282806963, -0.120196559, 5.00000000, nan, nan}
    [0]: -inf
    [1]: inf
    [2]: 0.976761103
    [3]: 0.282806963
    [4]: -0.120196559
    [5]: 5.00000000
    [6]: nan
    [7]: nan

Are you able to step into that call in a debugger? On Windows 10 it is defined in corecrt_math.h so wondering if there's something different in your setup.

    template <class _Ty>
    _Check_return_ inline bool isnan(_In_ _Ty _X) throw()
    {
        return fpclassify(_X) == FP_NAN;
    }

skottmckay on 27 Jul 2020

Maybe it is some strange compiler optimization?
This is how we compile it
https://github.com/xkszltl/Roaster/blob/8f7339d9d9cad1d60735d982035e95f4d867f74a/win/pkgs/ort.ps1#L150

We do have extra /GL and /LTCG but I think the negative sign is already strange in the first place.

xkszltl on 28 Jul 2020

Any chance you're using /fp:fast?

https://docs.microsoft.com/en-us/cpp/build/reference/fp-specify-floating-point-behavior?view=vs-2019

skottmckay on 28 Jul 2020

I don't add that explicitly.
As you can see in the script, here's all I add:

-DCMAKE_C_FLAGS="/GL /MP /Zi /arch:AVX2"
-DCMAKE_CXX_FLAGS="/EHsc /GL /MP /Zi /arch:AVX2"
-DCMAKE_EXE_LINKER_FLAGS="/DEBUG:FASTLINK /LTCG:incremental"
-DCMAKE_SHARED_LINKER_FLAGS="/DEBUG:FASTLINK /LTCG:incremental"
-DCMAKE_STATIC_LINKER_FLAGS="/LTCG:incremental"

I can try adding an extra /fp:precise at the end if you want.
Maybe /arch:AVX2 is more suspicious, but ort has official option for that as well.

xkszltl on 28 Jul 2020

Same error when I append /fp:precise:

PS D:\roaster-scratch\onnxruntime\build> .\Release\onnxruntime_test_all.exe --gtest_filter='*.CastFromString'
Note: Google Test filter = *.CastFromString
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TensorOpTest
[ RUN      ] TensorOpTest.CastFromString
D:\roaster-scratch\onnxruntime\onnxruntime\test\providers\provider_test_utils.cc(153): error: The difference between expected[i] and output[i] is nan, which exceeds threshold, where
expected[i] evaluates to -nan(ind),
output[i] evaluates to nan, and
threshold evaluates to 0.004999999888241291.
i:6, provider_type: CPUExecutionProvider
D:\roaster-scratch\onnxruntime\onnxruntime\test\providers\provider_test_utils.cc(153): error: The difference between expected[i] and output[i] is nan, which exceeds threshold, where
expected[i] evaluates to -nan(ind),
output[i] evaluates to nan, and
threshold evaluates to 0.004999999888241291.
i:7, provider_type: CPUExecutionProvider
D:\roaster-scratch\onnxruntime\onnxruntime\core\providers\cuda\cuda_call.cc:123 onnxruntime::CudaCall D:\roaster-scratch\onnxruntime\onnxruntime\core\providers\cuda\cuda_call.cc:117 onnxruntime::CudaCall CUDA failure 35: CUDA driver version is insufficient for CUDA runtime version ; GPU=1129250816 ; hostname=Roaster-Win ; expr=cudaSetDevice(device_id_);


Provider:CUDAExecutionProvider
unknown file: error: C++ exception with description "D:\roaster-scratch\onnxruntime\onnxruntime\core\providers\cuda\cuda_call.cc:123 onnxruntime::CudaCall D:\roaster-scratch\onnxruntime\onnxruntime\core\providers\cuda\cuda_call.cc:117 onnxruntime::CudaCall CUDA failure 35: CUDA driver version is insufficient for CUDA runtime version ; GPU=1129250816 ; hostname=Roaster-Win ; expr=cudaSetDevice(device_id_);

" thrown in the test body.
[  FAILED  ] TensorOpTest.CastFromString (26 ms)
[----------] 1 test from TensorOpTest (28 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (43 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TensorOpTest.CastFromString

 1 FAILED TEST

xkszltl on 28 Jul 2020

I'm not sure why std::isnan is failing to return true. Both -nan(ind) and nan should be considered to be a nan. https://stackoverflow.com/questions/61253965/nan-differences-stdnan-vs-quiet-nan-vs-macro-nan has some more details if that's helpful.

What happens if you build without any of the extra flags?

skottmckay on 29 Jul 2020

So far:

Removing /arch:AVX2 does not help.
Adding /fp:precise does not help.

Trying to remove /GL and /LTCG:incremental.

xkszltl on 29 Jul 2020

Quote from the stackoverflow answer:

By the way, it's like min/max macros, even if one had the bad idea to define them, don't use them

Love that.

Seems people also found their msvc has negative NAN.

xkszltl on 29 Jul 2020

Removing /GL and /LTCG:incremental works.
That usually indicates the existence of UB.

xkszltl on 29 Jul 2020

👍1

@skottmckay
Have you found the cause when you close the ticket?
LTO should not change the behavior of that if (isnan()), unless there's bug in code making compiler think it's always false.

xkszltl on 3 Aug 2020

Multiple of our CI builds enable LTO (all the release package builds do) and we don't see any issues there. Given that, what would be actionable from an onnxruntime perspective?

skottmckay on 3 Aug 2020

This is a strong (meaning I don't have other clue) indication about either msvc or ort was wrong.
Were you able to repro with the following on your side?

-DCMAKE_C_FLAGS="/GL /MP /Zi"
-DCMAKE_CXX_FLAGS="/EHsc /GL /MP /Zi"
-DCMAKE_EXE_LINKER_FLAGS="/DEBUG:FASTLINK /LTCG:incremental"
-DCMAKE_SHARED_LINKER_FLAGS="/DEBUG:FASTLINK /LTCG:incremental"
-DCMAKE_STATIC_LINKER_FLAGS="/LTCG:incremental"

We had trouble enabling ort's LTO flag due to conflict with protobuf used elsewhere in our system.
It used to be https://github.com/microsoft/onnxruntime/issues/902, and now it's banned.

++ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc-8 -DCMAKE_CXX_COMPILER=g++-8 -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache '-DCMAKE_C_FLAGS=-fdebug-prefix-map='\''/tmp/scratch'\''
='\''/usr/local/src'\'' -g -march=haswell -mtune=generic' '-DCMAKE_CXX_FLAGS=-fdebug-prefix-map='\''/tmp/scratch'\''='\''/usr/local/src'\'' -g -march=haswell -mtune=generic' -DCMAKE_INSTALL_PREFIX=/tmp/scratch/onnxruntime/install.yUpv0xXdzi/root/usr/loca
l -DCMAKE_POLICY_DEFAULT_CMP0060=NEW -DCMAKE_VERBOSE_MAKEFILE=ON -DPython_ADDITIONAL_VERSIONS=3.6 -Deigen_SOURCE_PATH=/usr/local/include/eigen3 -Donnxruntime_BUILD_FOR_NATIVE_MACHINE=OFF -Donnxruntime_BUILD_SHARED_LIB=ON -Donnxruntime_CUDNN_HOME=/usr/lib
64 -Donnxruntime_ENABLE_LANGUAGE_INTEROP_OPS=ON -Donnxruntime_ENABLE_LTO=ON -Donnxruntime_ENABLE_PYTHON=ON -Donnxruntime_PREFER_SYSTEM_LIB=ON -Donnxruntime_RUN_ONNX_TESTS=ON -Donnxruntime_USE_CUDA=ON -Donnxruntime_USE_DNNL=ON -Donnxruntime_USE_EIGEN_FOR_
BLAS=ON -Donnxruntime_USE_FULL_PROTOBUF=ON -Donnxruntime_USE_JEMALLOC=OFF -Donnxruntime_USE_LLVM=ON -Donnxruntime_USE_MKLML=OFF -Donnxruntime_USE_NGRAPH=OFF -Donnxruntime_USE_NUPHAR=OFF -Donnxruntime_USE_OPENBLAS=OFF -Donnxruntime_USE_OPENMP=OFF -Donnxru
ntime_USE_PREINSTALLED_EIGEN=OFF -Donnxruntime_USE_TENSORRT=OFF -Donnxruntime_USE_TVM=OFF -GNinja ../cmake
-- The C compiler identification is GNU 8.4.0
-- The CXX compiler identification is GNU 8.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/gcc-8 - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++-8 - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found PythonInterp: /usr/bin/python3 (found suitable version "3.6.9", minimum required is "3.5") 
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.6m.so (found suitable version "3.6.9", minimum required is "3.5") 
-- Found Protobuf: /usr/local/lib/libprotobuf.so;-lpthread (found version "3.12.4") 
Use protobuf from preinstalled system lib
CMake Warning at CMakeLists.txt:421 (message):
  Please enable Protobuf_USE_STATIC_LIBS


CMake Error at CMakeLists.txt:452 (target_link_options):
  Cannot specify link options for target "protoc" which is not built by this
  project.


# date: USE_SYSTEM_TZ_DB ON
# date: USE_TZ_DB_IN_DOT OFF
# date: BUILD_SHARED_LIBS OFF
# date: ENABLE_DATE_TESTING OFF
CMake Warning at CMakeLists.txt:497 (find_package):                                                                                                                                                                                                   [0/1878]
  By not providing "Findre2.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "re2", but
  CMake did not find one.

  Could not find a package configuration file provided by "re2" with any of
  the following names:

    re2Config.cmake
    re2-config.cmake

  Add the installation prefix of "re2" to CMAKE_PREFIX_PATH or set "re2_DIR"
  to a directory containing one of the above files.  If "re2" provides a
  separate development package or SDK, be sure it has been installed.


-- Performing Test HAS_UNUSED_BUT_SET_VARIABLE
-- Performing Test HAS_UNUSED_BUT_SET_VARIABLE - Success
-- Performing Test HAS_UNUSED_PARAMETER
-- Performing Test HAS_UNUSED_PARAMETER - Success
-- Performing Test HAS_UNUSED_VARIABLE
-- Performing Test HAS_UNUSED_VARIABLE - Success
-- Performing Test HAS_CAST_FUNCTION_TYPE
-- Performing Test HAS_CAST_FUNCTION_TYPE - Success
-- Performing Test HAS_PARENTHESES
-- Performing Test HAS_PARENTHESES - Success
-- Performing Test HAS_USELESS_CAST
-- Performing Test HAS_USELESS_CAST - Success
-- Performing Test HAS_NONNULL_COMPARE
-- Performing Test HAS_NONNULL_COMPARE - Success
-- Performing Test HAS_TAUTOLOGICAL_POINTER_COMPARE
-- Performing Test HAS_TAUTOLOGICAL_POINTER_COMPARE - Failed
-- Performing Test HAS_CATCH_VALUE
-- Performing Test HAS_CATCH_VALUE - Success
-- Performing Test HAS_MISSING_BRACES
-- Performing Test HAS_MISSING_BRACES - Success
-- Performing Test HAS_IGNORED_ATTRIBUTES
-- Performing Test HAS_IGNORED_ATTRIBUTES - Success
-- Performing Test HAS_DEPRECATED_COPY
-- Performing Test HAS_DEPRECATED_COPY - Failed
-- Performing Test HAS_DEPRECATED_DECLARATIONS
-- Performing Test HAS_DEPRECATED_DECLARATIONS - Success
-- Performing Test HAS_CLASS_MEMACCESS
-- Performing Test HAS_CLASS_MEMACCESS - Success
-- Performing Test HAS_MAYBE_UNINITIALIZED
-- Performing Test HAS_MAYBE_UNINITIALIZED - Success
-- The CUDA compiler identification is NVIDIA 11.0.194
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- CMAKE_CUDA_COMPILER_VERSION: 11.0.194
-- Looking for clock_gettime in rt
-- Looking for clock_gettime in rt - found
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/gcc-8
-- Performing Test HAS_AVX512F
-- Performing Test HAS_AVX512F - Success
-- Performing Test COMPILES_AVX512F
-- Performing Test COMPILES_AVX512F - Success
-- Performing Test HAS_AVX512CORE
-- Performing Test HAS_AVX512CORE - Success
-- Performing Test COMPILES_AVX512CORE
-- Performing Test COMPILES_AVX512CORE - Success
-- Found PythonInterp: /usr/bin/python3 (found version "3.6.9") 
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.6m.so
-- Found NumPy: /usr/local/lib/python3.6/dist-packages/numpy/core/include (found version "1.19.1") 
-- NumPy ver. 1.19.1 found (include: /usr/local/lib/python3.6/dist-packages/numpy/core/include)
-- Configuring incomplete, errors occurred!
See also "/tmp/scratch/onnxruntime/build/CMakeFiles/CMakeOutput.log".
See also "/tmp/scratch/onnxruntime/build/CMakeFiles/CMakeError.log".

xkszltl on 3 Aug 2020

I was able to repro.

You could try this change to workaround it: https://github.com/microsoft/onnxruntime/compare/skottmckay/WorkaroundLinkerBug

Or alternatively just using /LTCG instead of /LTCG:incremental works.

skottmckay on 4 Aug 2020

👍1

Great!
Do you find this VS bug somewhere or have an ongoing discussion with VS team?

xkszltl on 4 Aug 2020

No - I looked at the generated assembly and tried a few things to make it behave. If you have time to create a minimal repro of the issue to file a VS bug please do.

If the change works for you I'll create a PR for it.

skottmckay on 4 Aug 2020

Will test and that you know.
You don't need to wait for the result before merging as the PR will fix an issue you repro regardless of my result.

xkszltl on 4 Aug 2020

It fixes it on my machine, but as I don't know the internal details of what the LTCG issue is I don't know if re-ordering the std::isinf and std::isnan will fix it with all MSVC toolchains. As such, additional data points showing it works would be very helpful.

skottmckay on 5 Aug 2020

Of course. Saying just because I'm working on tight schedule so maybe there are some delay.

xkszltl on 5 Aug 2020

Works for me

xkszltl on 5 Aug 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings