In my system monitor, I see individual build processes from an Alpaka build going up to 5-6 GB of RSS. This means that a mid-end Linux laptop with 8 GB of RAM can only afford a single build process before swapping, and a Windows machine with that much RAM might not be able build Alpaka without swapping at all (since Windows' baseline RAM consumption is closer to 3-4GB). Such build requirements seem excessive, and may hint at a template/header bloat problem.
I have productively used clang 9+'s -ftime-trace profiling feature to investigate such issues in projects that I am working on, and would recommend it over templight even though it's not a memory profiling tool (it can only measure build time profiles, and build time is an imperfect proxy for build memory usage) because it's infinitely easier to use and happens to work well enough most of the time. No special clang build, no protobuf files that require further analysis, just re-run your bad compilation process with clang++ -ftime-trace, take the generated JSON file, and throw it at chrome://tracing for inspection.
For extra context, the build's configuration is...
ALPAKA_ACC_CPU_BT_OMP4_ENABLE OFF
ALPAKA_ACC_CPU_B_OMP2_T_SEQ_EN ON
ALPAKA_ACC_CPU_B_SEQ_T_FIBERS_ ON
ALPAKA_ACC_CPU_B_SEQ_T_OMP2_EN ON
ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENA ON
ALPAKA_ACC_CPU_B_SEQ_T_THREADS ON
ALPAKA_ACC_CPU_B_TBB_T_SEQ_ENA ON
ALPAKA_ACC_GPU_CUDA_ENABLE OFF
ALPAKA_ACC_GPU_CUDA_ONLY_MODE OFF
ALPAKA_ACC_GPU_HIP_ENABLE OFF
ALPAKA_ACC_GPU_HIP_ONLY_MODE OFF
ALPAKA_BLOCK_SHARED_DYN_MEMBER 30
ALPAKA_CXX_STANDARD 14
ALPAKA_DEBUG 0
ALPAKA_DEBUG_OFFLOAD_ASSUME_HO ON
ALPAKA_EMU_MEMCPY3D OFF
ALPAKA_HIP_PLATFORM nvcc
ALPAKA_USE_INTERNAL_CATCH2 ON
BUILD_TESTING ON
Boost_FIBER_LIBRARY_RELEASE /usr/lib64/libboost_fiber.so.1.71.0
Boost_INCLUDE_DIR /usr/include
CMAKE_BUILD_TYPE Release
CMAKE_INSTALL_PREFIX /usr/local
CUDA_HOST_COMPILER
CUDA_SDK_ROOT_DIR CUDA_SDK_ROOT_DIR-NOTFOUND
RT_LIBRARY /usr/lib64/librt.so
alpaka_BUILD_EXAMPLES ON
...the compiler is GCC 10.1, and the top offending files seem to be:
Kernel<something> unit tests (Many of these have a max-RSS >3GB, but KernelWithTemplateArgumentDeduction specifically reaches >6GB and is scheduled at the same time as the math tests (3GB) and the BufTests (3.3GB))ViewPlainPtrTest goes up to 5.0 GB, and is scheduled at the same time as ViewSubViewTest which also goes up to 5.0 GB.Since all of these are unit tests, disabling them is an obvious workaround, but given the amount of affected tests and the deceiving simplicity of some of them, I'm sure that such issues will eventually creep up in applications using Alpaka as well. Further, this creates a barrier to prospective alpaka contributors.
On the same topic of contributor friendliness, several of these tests also take multiple minutes to build (2-5min on my machine) which does not seem nice for a contributor to alpaka trying to iterate over a failing unit test either.
Hello @HadrienG2 , thanks for investigating and providing the list of the most compile-time-consuming kernels. I do not have an immediate reason why it all takes so long. But we are definitely aware of this issue (we also have all unit tests running on CI for each PR), and need to take a look there. Thanks for giving it a push!
The memory consummation of C++ code with a heavy use of templates is known, I think we can not do something against it.
A possible way for a library provider who is using alpaka as backend is to provide pre-compiled libraries for common used types.
Maybe this is possible for us as alpaka developer to provide pre-compiled parts of alpaka but we have not evaluated it.
One reason for the high memory consummation is that all available accelerators will be automatically enabled for testing. A typically use case is to build your code only with few enabled accelerators.
The memory consummation of C++ code with a heavy use of templates is known, I think we can not do something against it.
Well, the reason why I am giving tooling suggestions at the beginning this issue is that I am working on this problem in some other C++ projects, so I know that some things can be done ;)
The general strategy that I know of is to...
extern template to have client code use itThanks @HadrienG2 .
I think some of alpaka's internals are definitely guilty of violating this one:
Make sure that code which needs to be templated has as few template parameters as possible (e.g. let's say you have a device-independent but dimension-generic algorithm region, then it should be moved to a code region which is not templated on device but templated on dimension).
Alpaka unit tests violate the following rule on purpose:
Make sure that nothing silly (e.g. exponential number of type/function instantiations) happens at compilation time.
This is the main reason that the memory consumtion and compile time is high.
The unit tests are compiled for the cartesian product of available accelerators (~8), different dimensionalities (~4), different index types (~8) and in some tests also even more template parameters to get the coverage as high as possible.
What this means is that we compile each test case with ~256 different versions (numbers are only rough estimates). The result is a higher memory consumption and long compilation which we have to live with if we want to get a high coverage which developers should always aim for.
As a user of alpaka you will not compile your program with all of those combinations. A specific kernel is only compiled with one set of template parameters so compilation time should not be an issues there. The unit tests are no real world use-cases.
@BenjaminW3 How about splitting these different configurations into different compilation units?
@HadrienG2 I think with how the tests are currently done this is not easily possible, as everything is templatized. I am not saying this is not possible in principle tho.
@sbastrakov One way I can think of (not necessarily the best way) is to have a "define-based test instance" cpp file which instantiates the generic test with certain template parameters based on a set of compiler defines.
Then you instruct CMake to make multiple executables out of this file and add them as tests, each using one set of defines per desired combination of template parameters.
That's a little bit ugly (like everything which involves CMake scripting), but as long as you don't need multiple template instantiations at the same time it should work.
Yes, we could do that. Perhaps even easier by just enabling one accelerator for the whole test suite. Btw one could do it manually as well (but it's a little wordy, as each accelerator has to be enabled/disable separately). Ideally, of course also with a script that would iterate over all that.
It should be possible to split some of those test cpp files into more compilation units until we only have one test case per file.
That's also true, and probably makes sense even for easier navigation @BenjaminW3 . I think we don't really care if there are many files for unit tests, as they are all grouped by folders anyways.
Most helpful comment
Well, the reason why I am giving tooling suggestions at the beginning this issue is that I am working on this problem in some other C++ projects, so I know that some things can be done ;)
The general strategy that I know of is to...
extern templateto have client code use it