Trilinos: Kokkos/Trilinos: Trouble building simple example with lambda with MPI+CUDA build

Created on 19 Sep 2019  路  2Comments  路  Source: trilinos/Trilinos

I'm on master of Trilinos as of now. I'm on Debian with nvcc 9.2.148, g++/gcc 8.3.0, and OpenMPI. Trying to use a build of Trilinos that supports OpenMP, CUDA, and/or OpenMPI.

I'm trying to run a simple example using Kokkos with a lambda like

  Kokkos::parallel_for(15, [=](const int i) {
    std::cout << "Hello from i = " << i << std::endl;
  });

However this gives the following error:

/home/foo/trilinos-install/include/Cuda/Kokkos_Cuda_KernelLaunch.hpp(451): error: The closure type for a lambda ("lambda [](int)->void", defined at /home/foo/code/main.cpp:10) cannot be used in the template argument type of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the lambda is an 'extended lambda' and the flag --expt-extended-lambda is specified

For the life of me I've tried setting --expt-extended-lambda in many ways but nothing seems to resolve this error. AFAICT the flag truly is getting passed to the compiler, but still running into this issue. Any ideas how to debug this and/or determine if it's a user error or real bug with Trilinos/Kokkos? Does anyone have a working setup that builds an example program against a Trilinos that supports OpenMP, CUDA, and OpenMPI?

The compile command I end up running looks like

nvcc --expt-relaxed-constexpr --expt-extended-lambda --expt-relaxed-constexpr --expt-extended-lambda -ccbin g++ -arch=sm_35 -I/home/foo/code/. -I/home/foo/trilinos-install/include -O3 -DNDEBUG -I/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -I/usr/lib/x86_64-linux-gnu/openmpi/include -Xcompiler -std=gnu++14,-pthread -x cu /home/foo/code/main.cpp -c -o CMakeFiles/code/main.cpp.o

though I've tried a number of variants as well, all with the same result.

My Trilinos install was configured using

export NVCC_WRAPPER_DEFAULT_COMPILER=mpicxx && cmake ../Trilinos -DTrilinos_ENABLE_Kokkos=ON -DTrilinos_ENABLE_OpenMP=ON -DTrilinos_ENABLE_Pthread=ON -DTPL_ENABLE_CUDA=ON -DTrilinos_ENABLE_Tpetra=ON -DTrilinos_ENABLE_Belos=ON -DTrilinos_ENABLE_Amesos2=ON -DTrilinos_ENABLE_MueLu=ON -DTpetra_INST_SERIAL=ON -DTpetra_INST_OPENMP=ON -DKOKKOS_ENABLE_OPENMP=ON -DKokkos_ENABLE_Cuda_Lambda=ON -DTPL_ENABLE_MPI=ON -DCMAKE_CXX_FLAGS="--expt-extended-lambda" -DKokkos_ENABLE_Cuda_UVM=ON -DKOKKOS_ENABLE_CUDA_UVM=ON -DKokkos_ENABLE_CXX11_DISPATCH_LAMBDA=ON -DKokkos_ENABLE_Cuda_Lambda=ON -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=/home/foo/Trilinos/packages/kokkos/bin/nvcc_wrapper -DCMAKE_INSTALL_PREFIX=/home/foo/trilinos-install

Thanks in advance for any help!

question

Most helpful comment

@DABH A couple code changes are necessary to get this running when CUDA is enabled:

  1. replace the [=] capture with the macro KOKKOS_LAMBDA.
  2. replace std::cout usage with a printf

The KOKKOS_LAMBDA macro will provide markings necessary to compile the kernel so that it is capable of executing on the GPU. Lambda support must also be enabled in your CMake configuration options (I post a starting cmake line below).

Since you are providing an iteration range, 15, to the parallel_for it is compiling the code for the default execution space - you are compiling with CUDA enabled so this is the default execution space. std::cout is not supported within CUDA kernels AFAIK and should be replaced with printf statements. If you did intend for this to execute on the host without making changes, then a policy making an explicit request for a host execution space must be supplied, e.g. use of a RangePolicy like so:

  Kokkos::parallel_for(Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>(0,15), [=](const int i) {
    std::cout << "Hello from i = " << i << std::endl;
  });

Also, you should be using the nvcc_wrapper to compile CUDA builds.

The snip below may help get you started, a couple comments:

  1. Kokkos does not require UVM, but most of Trilinos does. The configure option is enabled in the snip below, along with some environment variables that should be set.
  2. You may need to change the KOKKOS_ARCH variables depending on your type of CPU and GPU, see this link for architecture variables
export OMPI_CXX=${TRILINOS_DIR}/packages/kokkos/bin/nvcc_wrapper
# Set the flags below if using packages that require UVM
export CUDA_LAUNCH_BLOCKING=1
export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1

cmake \
  -D CMAKE_INSTALL_PREFIX:PATH="${TRILINOS_INSTALL_PATH}" \
  -D TPL_ENABLE_MPI:BOOL=ON\
  -D TPL_ENABLE_CUDA:BOOL=ON\
  -D Trilinos_ENABLE_Kokkos:BOOL=ON \
  -D Kokkos_ENABLE_Cuda:BOOL=ON \
  -D Kokkos_ENABLE_Cuda_UVM:BOOL=ON \
  -D Kokkos_ENABLE_Cuda_Lambda:BOOL=ON \
  -D KOKKOS_ARCH="HSW,Pascal60" \
  ${TRILINOS_DIR}  

All 2 comments

@DABH A couple code changes are necessary to get this running when CUDA is enabled:

  1. replace the [=] capture with the macro KOKKOS_LAMBDA.
  2. replace std::cout usage with a printf

The KOKKOS_LAMBDA macro will provide markings necessary to compile the kernel so that it is capable of executing on the GPU. Lambda support must also be enabled in your CMake configuration options (I post a starting cmake line below).

Since you are providing an iteration range, 15, to the parallel_for it is compiling the code for the default execution space - you are compiling with CUDA enabled so this is the default execution space. std::cout is not supported within CUDA kernels AFAIK and should be replaced with printf statements. If you did intend for this to execute on the host without making changes, then a policy making an explicit request for a host execution space must be supplied, e.g. use of a RangePolicy like so:

  Kokkos::parallel_for(Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>(0,15), [=](const int i) {
    std::cout << "Hello from i = " << i << std::endl;
  });

Also, you should be using the nvcc_wrapper to compile CUDA builds.

The snip below may help get you started, a couple comments:

  1. Kokkos does not require UVM, but most of Trilinos does. The configure option is enabled in the snip below, along with some environment variables that should be set.
  2. You may need to change the KOKKOS_ARCH variables depending on your type of CPU and GPU, see this link for architecture variables
export OMPI_CXX=${TRILINOS_DIR}/packages/kokkos/bin/nvcc_wrapper
# Set the flags below if using packages that require UVM
export CUDA_LAUNCH_BLOCKING=1
export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1

cmake \
  -D CMAKE_INSTALL_PREFIX:PATH="${TRILINOS_INSTALL_PATH}" \
  -D TPL_ENABLE_MPI:BOOL=ON\
  -D TPL_ENABLE_CUDA:BOOL=ON\
  -D Trilinos_ENABLE_Kokkos:BOOL=ON \
  -D Kokkos_ENABLE_Cuda:BOOL=ON \
  -D Kokkos_ENABLE_Cuda_UVM:BOOL=ON \
  -D Kokkos_ENABLE_Cuda_Lambda:BOOL=ON \
  -D KOKKOS_ARCH="HSW,Pascal60" \
  ${TRILINOS_DIR}  

You rock @ndellingwood !! :metal: Thank you so much for the fast and very helpful reply.

I suspect the lambda capture was the culprit -- I had been going based off the tutorial at https://github.com/trilinos/trilinos_tutorial/wiki/KokkosTutorial01b but I guess that is way out of date (should it be removed/updated?). I just found https://github.com/trilinos/Trilinos/blob/master/packages/kokkos/example/tutorial/01_hello_world_lambda/hello_world_lambda.cpp which looks very similar but has the correct semantics. Anyway, taking your code suggestions allowed my program to finally compile.

Now it seems I'm running into having the wrong KOKKOS_ARCH ("Kokkos::Cuda::initialize ERROR: running kernels compiled for compute capability 3.5 (< 5.0) on device with compute capability 5.2 (>=5.0), this would give incorrect results!") but that is not surprising based on how I configured Trilinos. I'll follow your instructions, set that variable properly, and should hopefully be good to go. Many thanks again!

Was this page helpful?
0 / 5 - 0 ratings