Alpaka: how to use the OpenMP 4.0 backend ?

Created on 18 Feb 2020 · 14Comments · Source: alpaka-group/alpaka

Hi,
I am trying to use the OpenMP4 backend with GCC 7.4 on Ubuntu 18.04, but it is not working.

I have modified example/vectorAdd/src/vectorAdd.cpp to use AccCpuOmp4:

diff --git a/example/vectorAdd/src/vectorAdd.cpp b/example/vectorAdd/src/vectorAdd.cpp
index b14f0198ba49..d9d47b5819fd 100644
--- a/example/vectorAdd/src/vectorAdd.cpp
+++ b/example/vectorAdd/src/vectorAdd.cpp
@@ -95,7 +95,7 @@ auto main()
     // - AccCpuOmp4
     // - AccCpuTbbBlocks
     // - AccCpuSerial
-    using Acc = alpaka::acc::AccCpuSerial<Dim, Idx>;
+    using Acc = alpaka::acc::AccCpuOmp4<Dim, Idx>;
     using DevAcc = alpaka::dev::Dev<Acc>;
     using PltfAcc = alpaka::pltf::Pltf<DevAcc>;

Then I tried building it with

mkdir example/vectorAdd/build
cd example/vectorAdd/build
cmake ..
make

cmake seems happy enough:

-- The C compiler identification is GNU 7.4.0
-- The CXX compiler identification is GNU 7.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found TBB: /usr/include (found suitable version "2019.0", minimum required is "2.2")  
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /usr/local/cuda (found suitable version "10.2", minimum required is "9.0") 
-- ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLED
-- ALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLED
-- ALPAKA_ACC_CPU_BT_OMP4_ENABLED
-- ALPAKA_ACC_GPU_CUDA_ENABLED
-- Found alpaka: /home/fwyzard/src/alpaka/alpaka/include (found version "0.5.0") 
-- Configuring done
-- Generating done
-- Build files have been written to: /home/fwyzard/src/alpaka/alpaka/example/vectorAdd/build

but make fails with

[ 50%] Building NVCC (Device) object CMakeFiles/vectorAdd.dir/src/vectorAdd_generated_vectorAdd.cpp.o
Scanning dependencies of target vectorAdd
[100%] Linking CXX executable vectorAdd
lto1: internal compiler error: bytecode stream: expected tag fixed_point_type instead of error_mark
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://gcc.gnu.org/bugs/> for instructions.
mkoffload: fatal error: /usr/bin/x86_64-linux-gnu-accel-nvptx-none-gcc-7 returned 1 exit status
compilation terminated.
lto-wrapper: fatal error: /usr/lib/gcc/x86_64-linux-gnu/7//accel/nvptx-none/mkoffload returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status
CMakeFiles/vectorAdd.dir/build.make:99: recipe for target 'vectorAdd' failed
make[2]: *** [vectorAdd] Error 1
CMakeFiles/Makefile2:93: recipe for target 'CMakeFiles/vectorAdd.dir/all' failed
make[1]: *** [CMakeFiles/vectorAdd.dir/all] Error 2
Makefile:100: recipe for target 'all' failed
make: *** [all] Error 2

It looks like the compiler is trying to offload to the nvptx backend, and fails.

How do I make use of the OpenMP 4 backend to run the code on the CPU ?

OpenMP Bug

Source

fwyzard

👍1

All 14 comments

Note that I get similar errors even if I disable CUDA support with cmake -DALPAKA_ACC_GPU_CUDA_ENABLE=false, and/or if I use gcc 8 or gcc 9.

fwyzard on 18 Feb 2020

@fwyzard Thanks for the report, we will have a look into it.

Note: we are currently working on a full refactoring to bring the OMP4 backend to OMP5. You can find the ongoing work here

psychocoderHPC on 18 Feb 2020

👍1

Thanks for the pointer, passing -DCMAKE_CXX_FLAGS="-fno-lto" to cmake seems to help:

Scanning dependencies of target vectorAdd
[ 50%] Building CXX object CMakeFiles/vectorAdd.dir/src/vectorAdd.cpp.o
[100%] Linking CXX executable vectorAdd
[100%] Built target vectorAdd

and the example gives the correct results

Execution results correct!

fwyzard on 18 Feb 2020

👍1

It also works in my test application:

$ ./test-alpaka
Got 53376 for cabling, wordCounter 36328

Running with the blocking serial CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 465.53 us

Running with the non-blocking TBB CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 256.43 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 214.22 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 629.05 us

though it seems to have the worst performance :-(

Should I try using a different partitioning into blocks, threads and elements ?

fwyzard on 18 Feb 2020

By the way, -foffload=disable also seems to work fine, and does not prevent using LTO.

fwyzard on 18 Feb 2020

@fwyzard thanks for reporting your results, interesting. I think your configuration is fine: 1 thread per block is reasonable on CPUs, and elements are just controlled by you as a client application developer, alpaka does not really interfere there. I assume you have some loop over the elements inside the kernel?

sbastrakov on 18 Feb 2020

Yes:

the inner loop runs over the consecutive elements;
the outer loop is strided by the whole grid size (blocks * threads * elements) to cover the case where not enough blocks are launched.

See https://github.com/fwyzard/pixel-standalone/blob/master/rawtodigi_alpaka.cc#L376-L385 .

fwyzard on 18 Feb 2020

Thanks for the link, looks good. Unclear why does OpenMP 4 underperform.

sbastrakov on 18 Feb 2020

The current OpenMP 4 backend has its offloading disabled, it will force the compiler to generate host code.

With one thread per block, I do not know why the CPU performance would be bad as far as the backend is concerned. It actually shares most of its code with the OMP2 backend. Only the target bit is different, but forced to run on the host. I would guess, that performance is an compiler issue, which may be better at generating OMP2 code than OMP4.

To compile for offloading to nvptx with gcc, you need the -fno-lto flag, maybe there is just no link-time optimization in nvlink (which is something mkoffload probably calls internally when offloading). It thus fails when asked to perform lto.

jkelling on 18 Feb 2020

OK, I'll stick to -fno-lto then.

To compile for offloading to nvptx with gcc, you need the -fno-lto flag, maybe there is just no link-time optimization in nvlink (which is something mkoffload probably calls internally when offloading). It thus fails when asked to perform lto.

Does the offloading to NVPTX work , then ?
How can I test if anything is being offloaded ?

fwyzard on 18 Feb 2020

No, offloading is not supported in the current version of the backend, it is host only. Compiling with -foffload=disable and without should give you the same results, except than in the latter case mkoffload will be called pointlessly.

The new version of the backend @psychocoderHPC mentioned supports offloading.

jkelling on 18 Feb 2020

Understood, thanks.

fwyzard on 18 Feb 2020

By thw way, it looks like cupla does not switch the threads and elements when using the OpenMP 4 backend, as it does for the TBB and OpenMP 2.0 block backends:

Running with the blocking serial CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 471.79 us

Running with the non-blocking serial CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 575.03 us

Running with the non-blocking TBB CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 240.64 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 186.92 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 128157 us

fwyzard on 18 Feb 2020

😕1

Thanks for reporting and creating a cupla issue for that @fwyzard .

sbastrakov on 19 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Make sure that the kernel function returns void with `ALPAKA_ACC_GPU_CUDA_ONLY_MODE`

BenjaminW3 · 5Comments

readthedocs does not build new documentation

SimeonEhrig · 5Comments

Clang still fails

tdd11235813 · 4Comments

mem* tests fail since e8b70cc2a7

jkelling · 4Comments

Considerations for C++17-only release

BenjaminW3 · 3Comments