Alpaka: CUDA backend requires nvcc also for library calls

Created on 8 Jan 2020  路  8Comments  路  Source: alpaka-group/alpaka

Using Alpaka with the CUDA back-end requires compiling much more code with nvcc than using nativa CUDA would.

In general, when writing CUDA code, one has to compile with nvcc the files with device code (__device__ functions and __global__ kernels); while host code that calls the CUDA API runtime functions (e.g. cudaMalloc(), cudaMemcpy(), cudaLaunchKernel(), ...) can be compiled with gcc.
A simple example can be find in simple-kernel/cuda. Note that while kernel.cu is compiled by nvcc, test.cc is compiled by gcc.

Instead, writing Alpaka code for the CUDA backed requires building also the part that deals with allocating memory buffers and launching kernels with nvcc.
The corresponding example can be found in simple-kernel/alpaka. Note how both kernel.cc and test.cc need to be compiled by nvcc.

Given that nvcc lags behid in terms of support for C++ standards and features, this adds severe limitations to what code can interoperate with Alpaka.

What kind of changes would be required inside Alpaka to add similar partial support of the CUDA back-end for the host compiler ?

CUDA Question

All 8 comments

I am not sure if it is possible at all to achieve this with alpaka.

The first thing that prevents you from doing this is that ALPAKA_ADD_EXECUTABLE explicitly compiles all files as CUDA files because it can not differentiate between CUDA and non CUDA files by file extenstion. This is done so that the files can keep their .cpp extenstion irrespective of if the CUDA backend is enabled or not (so files never need a .cu extension).
This would have to be changed. However, I am not sure about the implications this will have on the possibility to compile without CUDA backend.

I would say you can not write easily alpaka code exactly like CUDA code. Alpaka memcpy's have templates and to compile with two different compilers you need to create objects files and merge these together. The compiler must create the binary code for the templated functions, this is only possible if you knows which types you are used. The accelerators and traits pulls always vender-compiler specific elements into the user code. IMO this can not be solved on a generic way.

For a C library like the CUDA API this is not a problem :-(

A possible workaround in general is to use extern template but this is currently not possible because we disable definitions e.g. the CUDA accelerator when CUDA is not available. At all extern templates are also something I personally do not like because it is very easy to produce linker errors.

Another issue is that alpaka.hpp includes all alpaka headers. Depending on the available backends CMake sets defines to enable or disable some code. This is done for all files being compiled and not only for some "device code" files. This results in incompatible code being visible and preprocessor checks failing. This may require to split the alpaka headers into the ones used within kernels and the ones from the outside.

I can probably come up with more problems but alpaka is simply not meant to support this scenario ;-(

@fwyzard you have the same problem with CUDA and cudaLaunchKernel when your kernel is using templates. You need to say somehow in the nvcc compiled part that a kernel is called with template arguments else the cuda code will crash at runtime because the kernel binary blob can not be found.

An other way to workaround it is to wrap alpaka memcopies in C functions (that is what I do in CUPLA) and compile it for a fixed accelerator and maybe buffer type.

If the limited C++ feature-set of CUDA is a problem, you can still divide your code into:
1) device kernel code (uses alpaka)
2) host code invoking the kernel (uses alpaka)
3) other c++ code (does not use alpaka)

With native CUDA you would compile 1 with nvcc and 2+3 with another compiler.
With alpaka you can compile 1+2 with nvcc and 3 with another compiler.

@fwyzard Just to mention, at least in PIConGPU we compile

  1. a library with host-only code (this can use newer C++ standards)
  2. an executable linked against the library in 1. that uses ALPAKA_ADD_EXECUTABLE (well, cupla_add_executable) to build host-device code.

https://github.com/ComputationalRadiationPhysics/picongpu/blob/4a90da17cc39678c79b915257fa66aa973338088/include/picongpu/CMakeLists.txt#L393-L417

Thank you all for your comments... I'll take some time to consider the options and get back here in few days.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

psychocoderHPC picture psychocoderHPC  路  5Comments

psychocoderHPC picture psychocoderHPC  路  4Comments

shefmarkh picture shefmarkh  路  4Comments

ax3l picture ax3l  路  4Comments

BenjaminW3 picture BenjaminW3  路  5Comments