The idea is to avoid spawning OpenMP parallel regions within an already parallelized region, wherein alpaka is executed.
One current solution on the user side could be to let the threads sleep and to use nested parallelism (do not have the details at the moment). However, the DLR team experienced heavy performance impacts, so this would not be a viable solution.
Another idea would be to reuse the parallel region and to avoid another parallel region.
alpaka first has to check, whether it is in a parallel region by omp_in_parallel().
This is just a proof-of-concept, coming from the DLR dev team.
--- a/thirdparty/alpaka-0.3.0/include/alpaka/exec/ExecCpuOmp2Blocks.hpp
+++ b/thirdparty/alpaka-0.3.0/include/alpaka/exec/ExecCpuOmp2Blocks.hpp
@@ -154,8 +154,28 @@ namespace alpaka
// Execute the blocks in parallel.
// NOTE: Setting num_threads(number_of_cores) instead of the default thread number does not improve performance.
- #pragma omp parallel
+ int const inParallel(omp_in_parallel());
+ if (inParallel)
{
+ parallel(boundKernelFnObj, blockSharedMemDynSizeBytes, numBlocksInGrid, gridBlockExtent);
+ }
+ else
+ {
+ #pragma omp parallel
+ parallel(boundKernelFnObj, blockSharedMemDynSizeBytes, numBlocksInGrid, gridBlockExtent);
+ }
+
+ // Reset the dynamic thread number setting.
+ ::omp_set_dynamic(ompIsDynamic);
+ }
+
+ TKernelFnObj m_kernelFnObj;
+ std::tuple<TArgs...> m_args;
+private:
+ template<typename FnT, typename BsT, typename GBT>
+ ALPAKA_FN_HOST auto parallel(const FnT& boundKernelFnObj, const BsT& blockSharedMemDynSizeBytes, const TSize& numBlocksInGrid, const GBT& gridBlockExtent) const
+ -> void
+ {
#if ALPAKA_DEBUG >= ALPAKA_DEBUG_MINIMAL
// The first thread does some debug logging.
if(::omp_get_thread_num() == 0)
@@ -194,14 +214,7 @@ namespace alpaka
// After a block has been processed, the shared memory has to be deleted.
block::shared::st::freeMem(acc);
}
- }
-
- // Reset the dynamic thread number setting.
- ::omp_set_dynamic(ompIsDynamic);
}
-
- TKernelFnObj m_kernelFnObj;
- std::tuple<TArgs...> m_args;
};
}
The changes have shown to be as performant as the baseline and reuses existing threads. If I understood it correctly, baseline means a code, that runs standalone using unchanged alpaka. With the changed alpaka, that code runs with the same performance within a multi-threaded context.
omp_in_parallel() could help for automatic detection, but maybe the interface also should offer a compile-time option?From my side this looks good and we can most probably integrate this.
We have to think about other accelerators. This is only a special case for the OpenMP 2 block accelerator where the number of threads is unrestricted. This would not work for the OpenMP 2 thread accelerator because there the number of OpenMP threads has to match the block size. I have no problem with integrating this special case here as long as we create a test in CI for this.
We might want to add assert(!omp_in_parallel()) to the other OpenMP accelerators for now.
I don't think the following is a real issue, but here is a potential difference in behaviour due to the proposed change.
In case one is sure OpenMP backend is used and calls OpenMP routines (e.g., omp_set_num_threads() or omp_set_dynamic()) to control environment before a kernel call. The difference is, originally these calls affect how the kernel is executed and in the new version they do not as no parallel section is created.
Once again, such hypothetical usage is a bad practice to begin with, and probably not that important anyways, but just to show the change is actually not completely equivalent.
The omp_set_dynamic and omp_set_num_threads calls can probably be removed from the OpenMP 2 block accelerator completely or at least moved into the if (!inParallel).
I probably expressed my idea unclearly and, after further consideration, now I think it is actually a serious issue (or maybe I just misunderstand how it all works).
Consider the code fragment given by @tdd11235813 . In case a kernel is called inside an existing parallel section, no new parallel section is created, which means thread ids remain the same as in the outer parallel section. So, e.g. if(::omp_get_thread_num() == 0) will check for thread id from the outer parallel section, which is clearly not the intent. All similar calls inside a kernel (although that would be not advisable any way) would also produce unexpected results. Btw ::omp_set_dynamic(ompIsDynamic); would also affect the outer parallel section, but this is easy to fix by putting it inside the correct branch.
Or in the described case any #pragma omp barrier (e.g. inside alpaka block synchronization) would be a deadlock waiting to happen.
@sbastrakov We are talking only about the OpenMP2 Blocks accelerator (atm). Alpaka supports only synchronization inside a block. However for parallelization over blocks with OpenMP2 only one thread per block is allowed and the alpaka sync threads function does nothing at all as all threads inside the block are implicitly synchronized. :wink:
Just a silly OpenMP specific cornercase for the record;
#pragma omp parallel
{
// ...
#pragma omp sections
{
#pragma omp section
alpaka_stuff_with_openmp();
Not sure, if omp_in_parallel() returns true or false.
In both cases it still might use only one thread, which actually is the expressed behavior (other sections become executed in parallel).
Same holds for #pragma omp single where other threads just skip that part.
Not sure, what happens, when a pragma omp parallel does within a pragma omp section.
The original question was: Are there combinations of OpenMP regions, where manual control of the openmp-awareness would be required?
@theZiz thanks for explaining, you are right about my last argument not being valid.
For me it still feels very dangerous to not create a parallel section, which provides kinda embedding with proper thread ids and other stuff, but ofc that requires an actual counter-example and not just a feeling.
You may be right, but at default a parallel region is still created, so our work flows will not be affected anyway. However the DLR people have a real world problem with our approach and an easy, fast and working solution. :smile:
Sure, my concern is it is working for their case, but not generally and, worse, it may be uncertain what are the conditions so that it definitely works. E.g. now with @tdd11235813 's edits being mechanically applied it would definitely be wrong, as there is a #pragma omp for a little below his code sample which would now (in case no additional parallel section is created) distribute work between all threads of the outer parallel section. Of course, this one is easy enough to fix.
Unrelated, but potentially performance-critical. Why is guided schedule used by default? In my experience it is generally horrible for 20+ threads (horrible like several times worse than static if loop iterations are light). Edit: several times difference might have been on Xeon Phi, so not just with 20 threads.
After an offline discussion with @psychocoderHPC and @theZiz I finally see the point. Sorry for initial misunderstanding.
The above idea has been implemented. Can this be closed now?
Ok, thanks for the ping. It is reported to the DLR devs, I have not tested it by myself. I close this for now and if there are further questions/issues, we can reopen it.
Most helpful comment
@sbastrakov We are talking only about the OpenMP2 Blocks accelerator (atm). Alpaka supports only synchronization inside a block. However for parallelization over blocks with OpenMP2 only one thread per block is allowed and the alpaka sync threads function does nothing at all as all threads inside the block are implicitly synchronized. :wink: