The current implementation has several flaws:
The scheduler should be statically dispatched. That is, there have to be several OpenMP for loops in alpaka (one for each schedule kind and one without anything) and then a template magic decides, which loop is compiled. After all, the schedule kind of a loop is a static property almost always (TBH I can't think of any reasonable special case).
Hello @krzikalla , thank you for the feedback.
We discussed this in a VC with @psychocoderHPC , I will try to make the changes for compile-time dispatching.
What we have currently in mind will base on https://github.com/alpaka-group/alpaka/blob/72785f6f413570d5179f9ac505b615d25f123cf4/example/openMPSchedule/src/openMPSchedule.cpp#L55-L61
You need to set a constexpr member for the kernel. The schedule will then be used to select the correct OMP for loop implementation at compile time.
All other interfaces should still follow the current implementation, this means you can not change the scheduling strategy within a parallel region. The behavior with the parallel region can not be influenced by alpaka because this is a restriction given by the standard.
Upon reading, I think with explicitly setting a schedule and compile-time dispatch as suggested, it's actually no problem to set it also inside the parallel region. The problem is using runtime schedule (like currently always), then nothing to be done inside the parallel region.
I wrote a small dummy implementation.
#run with scheduler defined by the kernel
g++ -fopenmp main.cpp -DRUNTIME=0
./a.out
call schedule static,2
Hello World... from thread = 0
Hello World... from thread = 1
Hello World... from thread = 2
Hello World... from thread = 3
Hello World... from thread = 4
#run with runtime scheduling
g++ -fopenmp main.cpp -DRUNTIME=1
./a.out
call schedule runtime,
Hello World... from thread = 0
Hello World... from thread = 2
Hello World... from thread = 1
Hello World... from thread = 3
Hello World... from thread = 4
[updated main.txt with an example for wrapping kernel to specialize the scheduling policy]
I updated my example above and added a mini wrapper to wrap kernel or lambda functions.
Not if we provide such wrapper the trait for dynamic shared memory can not be used because the user kernel signature is wrapped inside a helper.
I am unsure, if the alpaka::omp::Schedule struct will serve all possible needs. The problem here is, that schedule.kind is always a constexpr, while schedule.chunk_size may or may not be a constexpr. For those kinds, which support chunk sizes, three different compile-time versions are needed: one with the default chunk size (in contrast to the omp_set_schedule call there is no magic number for the default chunk size in the schedule declarator), one with a compile-time known constant (for optimization purposes esp. for static schedules) and one with a run-time variable. Therefore I think you have to separate schedule.kind and chunk size.
There is a magic number for default chunk size, 0. I mean in both OpenMP standard and alpaka::omp::Schedule
Ah, my message was only right for how it's done currently. Not for the hard-coded proposed way
Thanks for clarifications @krzikalla . I am not sure which of the cases you described forces the separation of struct into separate variables.
However, maybe that's because I am not sure if I understand the compile-time chunk size case well. I think the best alpaka can do (without relying on macros) is to provide chunk size as a constexpr variable so to have a pattern like #pragma omp for schedule(hard_coded, constexpr_variable). Is it what you mean?
I guess, you need something like this:
struct DefaultChunkSize {};
template<int cs> struct ConstexprChunkSize { constexpr int chunkSize = cs; };
struct VariableChunkSize { int chunkSize; };
template<ScheduleKind, class ChunkSizeTag = DefaultChunkSize>
struct Schedule : ChunkSizeTag {};
and then starting at TaskKernelCpuOmp2Blocks.hpp:198 you have to statically dispatch to functions like this:
void executeLoop(Schedule<Static, DefaultChunkSize>)
{
# pragma omp for nowait schedule(static)
for(TIdx i = 0; i < numBlocksInGrid; ++i)
{
auto const index = Vec<DimInt<1u>, TIdx>(i); // for issue #840
acc.m_gridBlockIdx = mapIdx<TDim::value>(index, gridBlockExtent);
boundKernelFnObj(acc);
freeSharedVars(acc);
}
}
template<int cs>
void executeLoop(Schedule<Static, ConstexprChunkSize<cs>>)
{
# pragma omp for nowait schedule(static, cs)
for(TIdx i = 0; i < numBlocksInGrid; ++i)
{
auto const index = Vec<DimInt<1u>, TIdx>(i); // for issue #840
acc.m_gridBlockIdx = mapIdx<TDim::value>(index, gridBlockExtent);
boundKernelFnObj(acc);
freeSharedVars(acc);
}
}
And so on, one function for each combination (should be 11, 3 * { static, dynamic, guided } + { auto, runtime }.
Now you only need to retireve the schedule type at TaskKernelCpuOmp2Blocks.hpp:198.
I updated my example code to support compile-time scheduler selection with a compile time chunk size. If required a kernel member dynamicChunkSize (do not slap me for the name, it is only an example) can be declared and be used to change the chunksize at runtime.
This way we would be compatible with 0.6.0, except that the member Schedule in a kernel should be constexpr.
In general, the compiler should be able to fully optimize.
struct MyKernel
{
};
struct MyKernel
{
static constexpr Schedule ompSchedule = Schedule{Schedule::Static, 5};
};
Schedule is compile-time and chunk size runtime (by default set to value given at compile time)
struct MyKernel
{
static constexpr Schedule ompSchedule = Schedule{Schedule::Static, 5};
int dynamicChunkSize = ompSchedule.chunkSize;
};
auto kernel = MyKernel{};
kernel.dynamicChunkSize = 22;
I wrote the prototype as a base for discussions next weak:
main.txt
What if we separate the schedule from the chunk size, so we can make them constexpr independently?
struct MyKernel1
{
static constexpr ScheduleKind ompScheduleKind = ScheduleKind::Static; // compile-time
int chunkSize = N; // dynamic
};
struct MyKernel2
{
static constexpr ScheduleKind ompScheduleKind = ScheduleKind::Static; // compile-time
static constexpr int chunkSize = 10; // compile-time
};
struct MyKernel3
{
ScheduleKind ompScheduleKind; // runtime
int chunkSize; // runtime
};
struct MyKernel4
{
static constexpr ScheduleKind ompScheduleKind = ScheduleKind::Dynamic; // compile-time
// no chunkSize
};
When we invoke the kernel, we can test whether the members are constexpr and perform compile-time dispatch on them. If they are runtime, we can either dispatch at runtime, or just pass the values on to OpenMP.
Here is a C++14 snippet that can detect whether the chunkSize is constexpr or not: https://godbolt.org/z/9T87fd
int chunkSize = N; // dynamic
should be static, just no constexpr?
int chunkSize = N; // dynamic
should be static, just no constexpr?
It can be bose, constexpr and non constexpr
int chunkSize = N; // dynamic
should be static, just no constexpr?
How do I enqueue the same kernel type from 2 threads at the same time with 2 different chunk sizes?
int chunkSize = N; // dynamic
should be static, just no constexpr?How do I enqueue the same kernel type from 2 threads at the same time with 2 different chunk sizes?
If the chunk size is not compile-time then each kernel can be an instance where you set the chunk size independent.
Yes, but indeed there is no need for static if not constexpr. I was just mistakenly thinking on the code used internally to check it, but such a code does not have to rely on static.
Most helpful comment
What if we separate the schedule from the chunk size, so we can make them constexpr independently?
When we invoke the kernel, we can test whether the members are constexpr and perform compile-time dispatch on them. If they are runtime, we can either dispatch at runtime, or just pass the values on to OpenMP.
Here is a C++14 snippet that can detect whether the chunkSize is constexpr or not: https://godbolt.org/z/9T87fd