Alpaka: number of threads per block

Created on 23 May 2018 · 7Comments · Source: alpaka-group/alpaka

I am still playing with vectorAdd example. And when I execute the program on different Acc, the number of threads per block seem to be fixed according to Acc I choose. For example when I use

AccCpuFibers: blockThreadExtent: (4) [I understand that blockThreadExtent is number of threads per block]
AccCpuThreads: blockThreadExtent: (256)
AccCpuOmp2Threads: blockThreadExtent: (32)

So the number of threads are 4, 256 and 32 accordingly. Where does alpaka have number of threads set for each Acc?

Question

Source

jiradaherbst

👍1

Most helpful comment

The number of threads per block can be chosen freely. It is part of the work division (index domain subdivision) as well as the number of blocks per grid and the number of elements per thread.

However, some accelerators are very limited in what they support. Especially the number of threads per block is often limited by the hardware because the threads really have to be executed in parallel to enable thread synchronization. The limits allowed by a given accelerator type on a given device can be read out via alpaka::acc::getAccDevProps<Acc>(dev). This method returns a AccDevProps structure with all the limits.

A valid work division for a given problem size (index domain) depends on the accelerator and device in use. To make it easier to switch between different accelerators, alpaka provides a alpaka::workdiv::getValidWorkDiv helper function which takes the given problem size, the accelerator, the device and some addition constraints and calculates a valid work division for this accelerator.
This getValidWorkDiv helper method is used by the vecAdd example. However, this is not necessary to use alpaka (all the other examples simply hard code the work division for the hard coded accelerator).

BenjaminW3 on 26 May 2018

👍2

All 7 comments

@ax3l, thanks for your answer!

Exactly, a backend already chooses an "optimal" block size (number of
threads per block) depending on the target. One can still overwrite them
with a C++ trait or derive a backend with different work-splitting.

Currently, the optimal sizes are calculated in

https://github.com/ComputationalRadiationPhysics/alpaka/blob/master/include/alpaka/workdiv/WorkDivHelpers.hpp

from device properties.

jiradaherbst on 23 May 2018

Thanks for documenting the question & answer! :)

ax3l on 23 May 2018

I do not see any open question anymore. @ax3l Why have you reopened it?

BenjaminW3 on 23 May 2018

I was unsure if this is something we want to add to the manual, e.g. in a FAQ section

ax3l on 23 May 2018

The number of threads per block can be chosen freely. It is part of the work division (index domain subdivision) as well as the number of blocks per grid and the number of elements per thread.

BenjaminW3 on 26 May 2018

👍2

@ax3l @psychocoderHPC
I am still thinking about renaming the "work division" (workdiv, workDiv) into "subdivision" (subdiv, subDiv) and to move it into the idx namespace because I think "index (domain) subdivision" matches better than "work division".
The current alpaka::workdiv namespace could be merged into the alpaka::idx namespace.
This would result in the following renamings:

alpaka::workdiv::getWorkDiv -> alpaka::idx::getSubDiv
alpaka::workdiv::getValidWorkDiv -> alapaka::idx::calcValidSubDiv

BenjaminW3 on 26 May 2018

From the naming aspect that sounds reasonable, but why would you like to merge the namespaces into idx? Could this be causing some confusion and it might be easier to grasp separated?

ax3l on 26 May 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Considerations for C++17-only release

BenjaminW3 · 3Comments

Boost 1.67.0 beta1

ax3l · 5Comments

Add support for CUDA 10.1

BenjaminW3 · 3Comments

Travis GitHub Marketplace

ax3l · 5Comments

mem* tests fail since e8b70cc2a7

jkelling · 4Comments