Alpaka: Build RAM requirements seem excessive

Created on 30 Jun 2020  路  12Comments  路  Source: alpaka-group/alpaka

In my system monitor, I see individual build processes from an Alpaka build going up to 5-6 GB of RSS. This means that a mid-end Linux laptop with 8 GB of RAM can only afford a single build process before swapping, and a Windows machine with that much RAM might not be able build Alpaka without swapping at all (since Windows' baseline RAM consumption is closer to 3-4GB). Such build requirements seem excessive, and may hint at a template/header bloat problem.

I have productively used clang 9+'s -ftime-trace profiling feature to investigate such issues in projects that I am working on, and would recommend it over templight even though it's not a memory profiling tool (it can only measure build time profiles, and build time is an imperfect proxy for build memory usage) because it's infinitely easier to use and happens to work well enough most of the time. No special clang build, no protobuf files that require further analysis, just re-run your bad compilation process with clang++ -ftime-trace, take the generated JSON file, and throw it at chrome://tracing for inspection.

4 Question

Most helpful comment

The memory consummation of C++ code with a heavy use of templates is known, I think we can not do something against it.

Well, the reason why I am giving tooling suggestions at the beginning this issue is that I am working on this problem in some other C++ projects, so I know that some things can be done ;)

The general strategy that I know of is to...

  • Make sure that nothing silly (e.g. exponential number of type/function instantiations) happens at compilation time.

    • This problem is more frequent than one would expect when using template metaprogramming techniques, and spotting it is perhaps what compiler profiling is most powerful for.

  • Move code which doesn't need to be templated (e.g. backend-independent logic) out of templates.

    • This can be done partially, by e.g. extracting template parameter agnostic utility functions.

  • Carefully consider the build time vs runtime tradeoff of static vs dynamic dispatch.

    • Sometimes an OOP style virtual function based design is good enough.

  • Make sure that code which needs to be templated has as few template parameters as possible

    • E.g. let's say you have a device-independent but dimension-generic algorithm region, then it should be moved to a code region which is not templated on device but templated on dimension.

  • Make sure that code which does not need to be in a header (not templated, not hot at runtime) lives in cpp files

    • I know that header-only libraries are hip these days, but we're talking about improving build performance here, not blindly following trends without questioning their implications ;)

  • Pre-instantiate common templates which are not too hot and use extern template to have client code use it

    • This is super easy to get wrong because there is no compiler error message when the extern template is not used. So you will want to split your template header files into declaration and definition parts and test the extern template instantiations with only the declaration part of the template visible.

All 12 comments

For extra context, the build's configuration is...

 ALPAKA_ACC_CPU_BT_OMP4_ENABLE    OFF
 ALPAKA_ACC_CPU_B_OMP2_T_SEQ_EN   ON
 ALPAKA_ACC_CPU_B_SEQ_T_FIBERS_   ON
 ALPAKA_ACC_CPU_B_SEQ_T_OMP2_EN   ON
 ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENA   ON
 ALPAKA_ACC_CPU_B_SEQ_T_THREADS   ON
 ALPAKA_ACC_CPU_B_TBB_T_SEQ_ENA   ON
 ALPAKA_ACC_GPU_CUDA_ENABLE       OFF
 ALPAKA_ACC_GPU_CUDA_ONLY_MODE    OFF
 ALPAKA_ACC_GPU_HIP_ENABLE        OFF
 ALPAKA_ACC_GPU_HIP_ONLY_MODE     OFF
 ALPAKA_BLOCK_SHARED_DYN_MEMBER   30
 ALPAKA_CXX_STANDARD              14
 ALPAKA_DEBUG                     0
 ALPAKA_DEBUG_OFFLOAD_ASSUME_HO   ON
 ALPAKA_EMU_MEMCPY3D              OFF
 ALPAKA_HIP_PLATFORM              nvcc
 ALPAKA_USE_INTERNAL_CATCH2       ON
 BUILD_TESTING                    ON
 Boost_FIBER_LIBRARY_RELEASE      /usr/lib64/libboost_fiber.so.1.71.0
 Boost_INCLUDE_DIR                /usr/include
 CMAKE_BUILD_TYPE                 Release
 CMAKE_INSTALL_PREFIX             /usr/local
 CUDA_HOST_COMPILER               
 CUDA_SDK_ROOT_DIR                CUDA_SDK_ROOT_DIR-NOTFOUND
 RT_LIBRARY                       /usr/lib64/librt.so
 alpaka_BUILD_EXAMPLES            ON

...the compiler is GCC 10.1, and the top offending files seem to be:

  • Kernel<something> unit tests (Many of these have a max-RSS >3GB, but KernelWithTemplateArgumentDeduction specifically reaches >6GB and is scheduled at the same time as the math tests (3GB) and the BufTests (3.3GB))
  • ViewPlainPtrTest goes up to 5.0 GB, and is scheduled at the same time as ViewSubViewTest which also goes up to 5.0 GB.

Since all of these are unit tests, disabling them is an obvious workaround, but given the amount of affected tests and the deceiving simplicity of some of them, I'm sure that such issues will eventually creep up in applications using Alpaka as well. Further, this creates a barrier to prospective alpaka contributors.

On the same topic of contributor friendliness, several of these tests also take multiple minutes to build (2-5min on my machine) which does not seem nice for a contributor to alpaka trying to iterate over a failing unit test either.

Hello @HadrienG2 , thanks for investigating and providing the list of the most compile-time-consuming kernels. I do not have an immediate reason why it all takes so long. But we are definitely aware of this issue (we also have all unit tests running on CI for each PR), and need to take a look there. Thanks for giving it a push!

The memory consummation of C++ code with a heavy use of templates is known, I think we can not do something against it.
A possible way for a library provider who is using alpaka as backend is to provide pre-compiled libraries for common used types.
Maybe this is possible for us as alpaka developer to provide pre-compiled parts of alpaka but we have not evaluated it.

One reason for the high memory consummation is that all available accelerators will be automatically enabled for testing. A typically use case is to build your code only with few enabled accelerators.

The memory consummation of C++ code with a heavy use of templates is known, I think we can not do something against it.

Well, the reason why I am giving tooling suggestions at the beginning this issue is that I am working on this problem in some other C++ projects, so I know that some things can be done ;)

The general strategy that I know of is to...

  • Make sure that nothing silly (e.g. exponential number of type/function instantiations) happens at compilation time.

    • This problem is more frequent than one would expect when using template metaprogramming techniques, and spotting it is perhaps what compiler profiling is most powerful for.

  • Move code which doesn't need to be templated (e.g. backend-independent logic) out of templates.

    • This can be done partially, by e.g. extracting template parameter agnostic utility functions.

  • Carefully consider the build time vs runtime tradeoff of static vs dynamic dispatch.

    • Sometimes an OOP style virtual function based design is good enough.

  • Make sure that code which needs to be templated has as few template parameters as possible

    • E.g. let's say you have a device-independent but dimension-generic algorithm region, then it should be moved to a code region which is not templated on device but templated on dimension.

  • Make sure that code which does not need to be in a header (not templated, not hot at runtime) lives in cpp files

    • I know that header-only libraries are hip these days, but we're talking about improving build performance here, not blindly following trends without questioning their implications ;)

  • Pre-instantiate common templates which are not too hot and use extern template to have client code use it

    • This is super easy to get wrong because there is no compiler error message when the extern template is not used. So you will want to split your template header files into declaration and definition parts and test the extern template instantiations with only the declaration part of the template visible.

Thanks @HadrienG2 .

I think some of alpaka's internals are definitely guilty of violating this one:

Make sure that code which needs to be templated has as few template parameters as possible (e.g. let's say you have a device-independent but dimension-generic algorithm region, then it should be moved to a code region which is not templated on device but templated on dimension).

Alpaka unit tests violate the following rule on purpose:

Make sure that nothing silly (e.g. exponential number of type/function instantiations) happens at compilation time.

This is the main reason that the memory consumtion and compile time is high.
The unit tests are compiled for the cartesian product of available accelerators (~8), different dimensionalities (~4), different index types (~8) and in some tests also even more template parameters to get the coverage as high as possible.
What this means is that we compile each test case with ~256 different versions (numbers are only rough estimates). The result is a higher memory consumption and long compilation which we have to live with if we want to get a high coverage which developers should always aim for.

As a user of alpaka you will not compile your program with all of those combinations. A specific kernel is only compiled with one set of template parameters so compilation time should not be an issues there. The unit tests are no real world use-cases.

@BenjaminW3 How about splitting these different configurations into different compilation units?

@HadrienG2 I think with how the tests are currently done this is not easily possible, as everything is templatized. I am not saying this is not possible in principle tho.

@sbastrakov One way I can think of (not necessarily the best way) is to have a "define-based test instance" cpp file which instantiates the generic test with certain template parameters based on a set of compiler defines.

Then you instruct CMake to make multiple executables out of this file and add them as tests, each using one set of defines per desired combination of template parameters.

That's a little bit ugly (like everything which involves CMake scripting), but as long as you don't need multiple template instantiations at the same time it should work.

Yes, we could do that. Perhaps even easier by just enabling one accelerator for the whole test suite. Btw one could do it manually as well (but it's a little wordy, as each accelerator has to be enabled/disable separately). Ideally, of course also with a script that would iterate over all that.

It should be possible to split some of those test cpp files into more compilation units until we only have one test case per file.

That's also true, and probably makes sense even for easier navigation @BenjaminW3 . I think we don't really care if there are many files for unit tests, as they are all grouped by folders anyways.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tdd11235813 picture tdd11235813  路  5Comments

ax3l picture ax3l  路  5Comments

BenjaminW3 picture BenjaminW3  路  3Comments

psychocoderHPC picture psychocoderHPC  路  5Comments

shefmarkh picture shefmarkh  路  4Comments