I'm on master of Trilinos as of now. I'm on Debian with nvcc 9.2.148, g++/gcc 8.3.0, and OpenMPI. Trying to use a build of Trilinos that supports OpenMP, CUDA, and/or OpenMPI.
I'm trying to run a simple example using Kokkos with a lambda like
Kokkos::parallel_for(15, [=](const int i) {
std::cout << "Hello from i = " << i << std::endl;
});
However this gives the following error:
/home/foo/trilinos-install/include/Cuda/Kokkos_Cuda_KernelLaunch.hpp(451): error: The closure type for a lambda ("lambda [](int)->void", defined at /home/foo/code/main.cpp:10) cannot be used in the template argument type of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function, or the lambda is an 'extended lambda' and the flag --expt-extended-lambda is specified
For the life of me I've tried setting --expt-extended-lambda in many ways but nothing seems to resolve this error. AFAICT the flag truly is getting passed to the compiler, but still running into this issue. Any ideas how to debug this and/or determine if it's a user error or real bug with Trilinos/Kokkos? Does anyone have a working setup that builds an example program against a Trilinos that supports OpenMP, CUDA, and OpenMPI?
The compile command I end up running looks like
nvcc --expt-relaxed-constexpr --expt-extended-lambda --expt-relaxed-constexpr --expt-extended-lambda -ccbin g++ -arch=sm_35 -I/home/foo/code/. -I/home/foo/trilinos-install/include -O3 -DNDEBUG -I/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -I/usr/lib/x86_64-linux-gnu/openmpi/include -Xcompiler -std=gnu++14,-pthread -x cu /home/foo/code/main.cpp -c -o CMakeFiles/code/main.cpp.o
though I've tried a number of variants as well, all with the same result.
My Trilinos install was configured using
export NVCC_WRAPPER_DEFAULT_COMPILER=mpicxx && cmake ../Trilinos -DTrilinos_ENABLE_Kokkos=ON -DTrilinos_ENABLE_OpenMP=ON -DTrilinos_ENABLE_Pthread=ON -DTPL_ENABLE_CUDA=ON -DTrilinos_ENABLE_Tpetra=ON -DTrilinos_ENABLE_Belos=ON -DTrilinos_ENABLE_Amesos2=ON -DTrilinos_ENABLE_MueLu=ON -DTpetra_INST_SERIAL=ON -DTpetra_INST_OPENMP=ON -DKOKKOS_ENABLE_OPENMP=ON -DKokkos_ENABLE_Cuda_Lambda=ON -DTPL_ENABLE_MPI=ON -DCMAKE_CXX_FLAGS="--expt-extended-lambda" -DKokkos_ENABLE_Cuda_UVM=ON -DKOKKOS_ENABLE_CUDA_UVM=ON -DKokkos_ENABLE_CXX11_DISPATCH_LAMBDA=ON -DKokkos_ENABLE_Cuda_Lambda=ON -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=/home/foo/Trilinos/packages/kokkos/bin/nvcc_wrapper -DCMAKE_INSTALL_PREFIX=/home/foo/trilinos-install
Thanks in advance for any help!
@DABH A couple code changes are necessary to get this running when CUDA is enabled:
[=] capture with the macro KOKKOS_LAMBDA. std::cout usage with a printfThe KOKKOS_LAMBDA macro will provide markings necessary to compile the kernel so that it is capable of executing on the GPU. Lambda support must also be enabled in your CMake configuration options (I post a starting cmake line below).
Since you are providing an iteration range, 15, to the parallel_for it is compiling the code for the default execution space - you are compiling with CUDA enabled so this is the default execution space. std::cout is not supported within CUDA kernels AFAIK and should be replaced with printf statements. If you did intend for this to execute on the host without making changes, then a policy making an explicit request for a host execution space must be supplied, e.g. use of a RangePolicy like so:
Kokkos::parallel_for(Kokkos::RangePolicy<Kokkos::DefaultHostExecutionSpace>(0,15), [=](const int i) {
std::cout << "Hello from i = " << i << std::endl;
});
Also, you should be using the nvcc_wrapper to compile CUDA builds.
The snip below may help get you started, a couple comments:
KOKKOS_ARCH variables depending on your type of CPU and GPU, see this link for architecture variablesexport OMPI_CXX=${TRILINOS_DIR}/packages/kokkos/bin/nvcc_wrapper
# Set the flags below if using packages that require UVM
export CUDA_LAUNCH_BLOCKING=1
export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1
cmake \
-D CMAKE_INSTALL_PREFIX:PATH="${TRILINOS_INSTALL_PATH}" \
-D TPL_ENABLE_MPI:BOOL=ON\
-D TPL_ENABLE_CUDA:BOOL=ON\
-D Trilinos_ENABLE_Kokkos:BOOL=ON \
-D Kokkos_ENABLE_Cuda:BOOL=ON \
-D Kokkos_ENABLE_Cuda_UVM:BOOL=ON \
-D Kokkos_ENABLE_Cuda_Lambda:BOOL=ON \
-D KOKKOS_ARCH="HSW,Pascal60" \
${TRILINOS_DIR}
You rock @ndellingwood !! :metal: Thank you so much for the fast and very helpful reply.
I suspect the lambda capture was the culprit -- I had been going based off the tutorial at https://github.com/trilinos/trilinos_tutorial/wiki/KokkosTutorial01b but I guess that is way out of date (should it be removed/updated?). I just found https://github.com/trilinos/Trilinos/blob/master/packages/kokkos/example/tutorial/01_hello_world_lambda/hello_world_lambda.cpp which looks very similar but has the correct semantics. Anyway, taking your code suggestions allowed my program to finally compile.
Now it seems I'm running into having the wrong KOKKOS_ARCH ("Kokkos::Cuda::initialize ERROR: running kernels compiled for compute capability 3.5 (< 5.0) on device with compute capability 5.2 (>=5.0), this would give incorrect results!") but that is not surprising based on how I configured Trilinos. I'll follow your instructions, set that variable properly, and should hopefully be good to go. Many thanks again!
Most helpful comment
@DABH A couple code changes are necessary to get this running when CUDA is enabled:
[=]capture with the macroKOKKOS_LAMBDA.std::coutusage with aprintfThe
KOKKOS_LAMBDAmacro will provide markings necessary to compile the kernel so that it is capable of executing on the GPU. Lambda support must also be enabled in your CMake configuration options (I post a starting cmake line below).Since you are providing an iteration range, 15, to the
parallel_forit is compiling the code for the default execution space - you are compiling with CUDA enabled so this is the default execution space.std::coutis not supported within CUDA kernels AFAIK and should be replaced withprintfstatements. If you did intend for this to execute on the host without making changes, then a policy making an explicit request for a host execution space must be supplied, e.g. use of aRangePolicylike so:Also, you should be using the
nvcc_wrapperto compile CUDA builds.The snip below may help get you started, a couple comments:
KOKKOS_ARCHvariables depending on your type of CPU and GPU, see this link for architecture variables