Hi,
I am trying to use the OpenMP4 backend with GCC 7.4 on Ubuntu 18.04, but it is not working.
I have modified example/vectorAdd/src/vectorAdd.cpp to use AccCpuOmp4:
diff --git a/example/vectorAdd/src/vectorAdd.cpp b/example/vectorAdd/src/vectorAdd.cpp
index b14f0198ba49..d9d47b5819fd 100644
--- a/example/vectorAdd/src/vectorAdd.cpp
+++ b/example/vectorAdd/src/vectorAdd.cpp
@@ -95,7 +95,7 @@ auto main()
// - AccCpuOmp4
// - AccCpuTbbBlocks
// - AccCpuSerial
- using Acc = alpaka::acc::AccCpuSerial<Dim, Idx>;
+ using Acc = alpaka::acc::AccCpuOmp4<Dim, Idx>;
using DevAcc = alpaka::dev::Dev<Acc>;
using PltfAcc = alpaka::pltf::Pltf<DevAcc>;
Then I tried building it with
mkdir example/vectorAdd/build
cd example/vectorAdd/build
cmake ..
make
cmake seems happy enough:
-- The C compiler identification is GNU 7.4.0
-- The CXX compiler identification is GNU 7.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found TBB: /usr/include (found suitable version "2019.0", minimum required is "2.2")
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found suitable version "10.2", minimum required is "9.0")
-- ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLED
-- ALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLED
-- ALPAKA_ACC_CPU_BT_OMP4_ENABLED
-- ALPAKA_ACC_GPU_CUDA_ENABLED
-- Found alpaka: /home/fwyzard/src/alpaka/alpaka/include (found version "0.5.0")
-- Configuring done
-- Generating done
-- Build files have been written to: /home/fwyzard/src/alpaka/alpaka/example/vectorAdd/build
but make fails with
[ 50%] Building NVCC (Device) object CMakeFiles/vectorAdd.dir/src/vectorAdd_generated_vectorAdd.cpp.o
Scanning dependencies of target vectorAdd
[100%] Linking CXX executable vectorAdd
lto1: internal compiler error: bytecode stream: expected tag fixed_point_type instead of error_mark
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://gcc.gnu.org/bugs/> for instructions.
mkoffload: fatal error: /usr/bin/x86_64-linux-gnu-accel-nvptx-none-gcc-7 returned 1 exit status
compilation terminated.
lto-wrapper: fatal error: /usr/lib/gcc/x86_64-linux-gnu/7//accel/nvptx-none/mkoffload returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status
CMakeFiles/vectorAdd.dir/build.make:99: recipe for target 'vectorAdd' failed
make[2]: *** [vectorAdd] Error 1
CMakeFiles/Makefile2:93: recipe for target 'CMakeFiles/vectorAdd.dir/all' failed
make[1]: *** [CMakeFiles/vectorAdd.dir/all] Error 2
Makefile:100: recipe for target 'all' failed
make: *** [all] Error 2
It looks like the compiler is trying to offload to the nvptx backend, and fails.
How do I make use of the OpenMP 4 backend to run the code on the CPU ?
Note that I get similar errors even if I disable CUDA support with cmake -DALPAKA_ACC_GPU_CUDA_ENABLE=false, and/or if I use gcc 8 or gcc 9.
@fwyzard Thanks for the report, we will have a look into it.
Note: we are currently working on a full refactoring to bring the OMP4 backend to OMP5. You can find the ongoing work here
Thanks for the pointer, passing -DCMAKE_CXX_FLAGS="-fno-lto" to cmake seems to help:
Scanning dependencies of target vectorAdd
[ 50%] Building CXX object CMakeFiles/vectorAdd.dir/src/vectorAdd.cpp.o
[100%] Linking CXX executable vectorAdd
[100%] Built target vectorAdd
and the example gives the correct results
Execution results correct!
It also works in my test application:
$ ./test-alpaka
Got 53376 for cabling, wordCounter 36328
Running with the blocking serial CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 465.53 us
Running with the non-blocking TBB CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 256.43 us
Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 214.22 us
Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 629.05 us
though it seems to have the worst performance :-(
Should I try using a different partitioning into blocks, threads and elements ?
By the way, -foffload=disable also seems to work fine, and does not prevent using LTO.
@fwyzard thanks for reporting your results, interesting. I think your configuration is fine: 1 thread per block is reasonable on CPUs, and elements are just controlled by you as a client application developer, alpaka does not really interfere there. I assume you have some loop over the elements inside the kernel?
Yes:
See https://github.com/fwyzard/pixel-standalone/blob/master/rawtodigi_alpaka.cc#L376-L385 .
Thanks for the link, looks good. Unclear why does OpenMP 4 underperform.
The current OpenMP 4 backend has its offloading disabled, it will force the compiler to generate host code.
With one thread per block, I do not know why the CPU performance would be bad as far as the backend is concerned. It actually shares most of its code with the OMP2 backend. Only the target bit is different, but forced to run on the host. I would guess, that performance is an compiler issue, which may be better at generating OMP2 code than OMP4.
To compile for offloading to nvptx with gcc, you need the -fno-lto flag, maybe there is just no link-time optimization in nvlink (which is something mkoffload probably calls internally when offloading). It thus fails when asked to perform lto.
OK, I'll stick to -fno-lto then.
To compile for offloading to nvptx with gcc, you need the -fno-lto flag, maybe there is just no link-time optimization in nvlink (which is something mkoffload probably calls internally when offloading). It thus fails when asked to perform lto.
Does the offloading to NVPTX work , then ?
How can I test if anything is being offloaded ?
No, offloading is not supported in the current version of the backend, it is host only. Compiling with -foffload=disable and without should give you the same results, except than in the latter case mkoffload will be called pointlessly.
The new version of the backend @psychocoderHPC mentioned supports offloading.
Understood, thanks.
By thw way, it looks like cupla does not switch the threads and elements when using the OpenMP 4 backend, as it does for the TBB and OpenMP 2.0 block backends:
Running with the blocking serial CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 471.79 us
Running with the non-blocking serial CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 575.03 us
Running with the non-blocking TBB CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 240.64 us
Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 186.92 us
Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 128157 us
Thanks for reporting and creating a cupla issue for that @fwyzard .