Alpaka: Implementation of user-defined scheduling for OpenMP should be changed

Created on 10 Feb 2021 · 17Comments · Source: alpaka-group/alpaka

The current implementation has several flaws:

It doesn't work, if you are already inside a forked region. Eg. QueueCpuOmp2CollectiveImpl won't work as expected. Some OpenMP implementations might work nevertheless., but it is not backed up by the standard.
It inflicts performance issues:
- There is always the need to request the runtime schedule. This requires some loads, tests and branches, which esp. with a lot of small loops is undesirable.
- Especially with static scheduling a lot of compile-time optimization potential is wasted.

The scheduler should be statically dispatched. That is, there have to be several OpenMP for loops in alpaka (one for each schedule kind and one without anything) and then a template magic decides, which loop is compiled. After all, the schedule kind of a loop is a static property almost always (TBH I can't think of any reasonable special case).

OpenMP Bug Enhancement

Source

krzikalla

Most helpful comment

What if we separate the schedule from the chunk size, so we can make them constexpr independently?

struct MyKernel1
{
  static constexpr ScheduleKind ompScheduleKind = ScheduleKind::Static; // compile-time
  int chunkSize = N; // dynamic
};
struct MyKernel2
{
  static constexpr ScheduleKind ompScheduleKind = ScheduleKind::Static; // compile-time
  static constexpr int chunkSize = 10; // compile-time
};
struct MyKernel3
{
  ScheduleKind ompScheduleKind; // runtime
  int chunkSize; // runtime
};
struct MyKernel4
{
  static constexpr ScheduleKind ompScheduleKind = ScheduleKind::Dynamic; // compile-time
  // no chunkSize
};

When we invoke the kernel, we can test whether the members are constexpr and perform compile-time dispatch on them. If they are runtime, we can either dispatch at runtime, or just pass the values on to OpenMP.

Here is a C++14 snippet that can detect whether the chunkSize is constexpr or not: https://godbolt.org/z/9T87fd

bernhardmgruber on 16 Feb 2021

👍3

All 17 comments

Hello @krzikalla , thank you for the feedback.

We discussed this in a VC with @psychocoderHPC , I will try to make the changes for compile-time dispatching.

sbastrakov on 10 Feb 2021

What we have currently in mind will base on https://github.com/alpaka-group/alpaka/blob/72785f6f413570d5179f9ac505b615d25f123cf4/example/openMPSchedule/src/openMPSchedule.cpp#L55-L61
You need to set a constexpr member for the kernel. The schedule will then be used to select the correct OMP for loop implementation at compile time.

All other interfaces should still follow the current implementation, this means you can not change the scheduling strategy within a parallel region. The behavior with the parallel region can not be influenced by alpaka because this is a restriction given by the standard.

psychocoderHPC on 10 Feb 2021

Upon reading, I think with explicitly setting a schedule and compile-time dispatch as suggested, it's actually no problem to set it also inside the parallel region. The problem is using runtime schedule (like currently always), then nothing to be done inside the parallel region.

sbastrakov on 10 Feb 2021

I wrote a small dummy implementation.

#run with scheduler defined by the kernel
g++ -fopenmp main.cpp -DRUNTIME=0
./a.out
call schedule static,2
Hello World... from thread = 0
Hello World... from thread = 1
Hello World... from thread = 2
Hello World... from thread = 3
Hello World... from thread = 4

#run with runtime scheduling
g++ -fopenmp main.cpp -DRUNTIME=1
./a.out
call schedule runtime,
Hello World... from thread = 0
Hello World... from thread = 2
Hello World... from thread = 1
Hello World... from thread = 3
Hello World... from thread = 4

main.txt

[updated main.txt with an example for wrapping kernel to specialize the scheduling policy]

psychocoderHPC on 10 Feb 2021

I updated my example above and added a mini wrapper to wrap kernel or lambda functions.
Not if we provide such wrapper the trait for dynamic shared memory can not be used because the user kernel signature is wrapped inside a helper.

psychocoderHPC on 10 Feb 2021

I am unsure, if the alpaka::omp::Schedule struct will serve all possible needs. The problem here is, that schedule.kind is always a constexpr, while schedule.chunk_size may or may not be a constexpr. For those kinds, which support chunk sizes, three different compile-time versions are needed: one with the default chunk size (in contrast to the omp_set_schedule call there is no magic number for the default chunk size in the schedule declarator), one with a compile-time known constant (for optimization purposes esp. for static schedules) and one with a run-time variable. Therefore I think you have to separate schedule.kind and chunk size.

krzikalla on 11 Feb 2021

~~There is a magic number for default chunk size, 0. I mean in both OpenMP standard and alpaka::omp::Schedule~~

sbastrakov on 11 Feb 2021

Ah, my message was only right for how it's done currently. Not for the hard-coded proposed way

sbastrakov on 11 Feb 2021

Thanks for clarifications @krzikalla . I am not sure which of the cases you described forces the separation of struct into separate variables.

However, maybe that's because I am not sure if I understand the compile-time chunk size case well. I think the best alpaka can do (without relying on macros) is to provide chunk size as a constexpr variable so to have a pattern like #pragma omp for schedule(hard_coded, constexpr_variable). Is it what you mean?

sbastrakov on 11 Feb 2021

I guess, you need something like this:

struct DefaultChunkSize {};
template<int cs> struct ConstexprChunkSize { constexpr int chunkSize = cs; };
struct VariableChunkSize { int chunkSize; };

template<ScheduleKind, class ChunkSizeTag = DefaultChunkSize>
struct Schedule : ChunkSizeTag {};

and then starting at TaskKernelCpuOmp2Blocks.hpp:198 you have to statically dispatch to functions like this:

void executeLoop(Schedule<Static, DefaultChunkSize>)
{
#        pragma omp for nowait schedule(static)
            for(TIdx i = 0; i < numBlocksInGrid; ++i)
            {
                auto const index = Vec<DimInt<1u>, TIdx>(i); // for issue #840
                acc.m_gridBlockIdx = mapIdx<TDim::value>(index, gridBlockExtent);
                boundKernelFnObj(acc);
                freeSharedVars(acc);
            }
}

template<int cs>
void executeLoop(Schedule<Static, ConstexprChunkSize<cs>>)
{
#        pragma omp for nowait schedule(static, cs)
            for(TIdx i = 0; i < numBlocksInGrid; ++i)
            {
                auto const index = Vec<DimInt<1u>, TIdx>(i); // for issue #840
                acc.m_gridBlockIdx = mapIdx<TDim::value>(index, gridBlockExtent);
                boundKernelFnObj(acc);
                freeSharedVars(acc);
            }
}

And so on, one function for each combination (should be 11, 3 * { static, dynamic, guided } + { auto, runtime }.
Now you only need to retireve the schedule type at TaskKernelCpuOmp2Blocks.hpp:198.

krzikalla on 11 Feb 2021

👍1

I updated my example code to support compile-time scheduler selection with a compile time chunk size. If required a kernel member dynamicChunkSize (do not slap me for the name, it is only an example) can be declared and be used to change the chunksize at runtime.

This way we would be compatible with 0.6.0, except that the member Schedule in a kernel should be constexpr.
In general, the compiler should be able to fully optimize.

schedule must be set via an environment variable or before the parallel region
struct MyKernel { };
schedule and chunk size is selected at compile time
struct MyKernel { static constexpr Schedule ompSchedule = Schedule{Schedule::Static, 5}; };

Schedule is compile-time and chunk size runtime (by default set to value given at compile time)

struct MyKernel
{
static constexpr Schedule ompSchedule = Schedule{Schedule::Static, 5};
int dynamicChunkSize = ompSchedule.chunkSize;
};

auto kernel = MyKernel{};
kernel.dynamicChunkSize = 22;

I wrote the prototype as a base for discussions next weak:
main.txt

psychocoderHPC on 11 Feb 2021

👍1

What if we separate the schedule from the chunk size, so we can make them constexpr independently?

struct MyKernel1
{
  static constexpr ScheduleKind ompScheduleKind = ScheduleKind::Static; // compile-time
  int chunkSize = N; // dynamic
};
struct MyKernel2
{
  static constexpr ScheduleKind ompScheduleKind = ScheduleKind::Static; // compile-time
  static constexpr int chunkSize = 10; // compile-time
};
struct MyKernel3
{
  ScheduleKind ompScheduleKind; // runtime
  int chunkSize; // runtime
};
struct MyKernel4
{
  static constexpr ScheduleKind ompScheduleKind = ScheduleKind::Dynamic; // compile-time
  // no chunkSize
};

Here is a C++14 snippet that can detect whether the chunkSize is constexpr or not: https://godbolt.org/z/9T87fd

bernhardmgruber on 16 Feb 2021

👍3

int chunkSize = N; // dynamic
should be static, just no constexpr?

sbastrakov on 16 Feb 2021

int chunkSize = N; // dynamic
should be static, just no constexpr?

It can be bose, constexpr and non constexpr

psychocoderHPC on 16 Feb 2021

int chunkSize = N; // dynamic
should be static, just no constexpr?

How do I enqueue the same kernel type from 2 threads at the same time with 2 different chunk sizes?

bernhardmgruber on 16 Feb 2021

int chunkSize = N; // dynamic
should be static, just no constexpr?

How do I enqueue the same kernel type from 2 threads at the same time with 2 different chunk sizes?

If the chunk size is not compile-time then each kernel can be an instance where you set the chunk size independent.

psychocoderHPC on 16 Feb 2021

Yes, but indeed there is no need for static if not constexpr. I was just mistakenly thinking on the code used internally to check it, but such a code does not have to rely on static.

sbastrakov on 16 Feb 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

CMake CUDA: 'target_compile_options' has no effect

tdd11235813 · 5Comments

Destructors should not throw

theZiz · 5Comments

Clang still fails

tdd11235813 · 4Comments

Make sure that the kernel function returns void with `ALPAKA_ACC_GPU_CUDA_ONLY_MODE`

BenjaminW3 · 5Comments

bad performance with OpenMP

psychocoderHPC · 4Comments