Kokkos: Feature Request: TeamVectorRange

Created on 29 Mar 2017 · 5Comments · Source: kokkos/kokkos

There are several instances in which I would like to flatten the TeamThread and ThreadVector levels of parallelism. In order for that to be portable currently, I currently do:

parallel_for (TeamThreadRange(team, N), [&] (const int i)
{
  single (PerThread(team), [&] ()
  {
    // loop body
    a(i) = b(i) + x*(c(i) + x*d(i));
  });
});

That is portable, but not necessarily performant. From the optimization report: remark #15346: vector dependence: assumed FLOW dependence between a.m_map.m_handle[i] (1242:8) and b.m_map.m_handle[i] (1242:8).

However, the following yields a vectorized loop.

#pragma ivdep
parallel_for (TeamThreadRange(team, N), [&] (const int i)
{
  single (PerThread(team), [&] ()
  {
    // loop body
    a(i) = b(i) + x*(c(i) + x*d(i));
  });
});

From a conversation with @crtrott, this is something that has been discussed before and after looking at the optimization report, it's clear I could greatly benefit from this capability.

Is there already a workaround commonly used to fix this, or is a TeamVectorRange what is needed here?

Feature Request InDevelop

Source

dholladay00

Most helpful comment

I too would like to have this feature. I can think of plenty of instances where three-level hierarchical parallelism is not always matched well to variable problem size, but yet I would still like to have an automatic insertion of IVDEP and not a manual one.

womeld on 18 Mar 2019

👍3

All 5 comments

Your workaround wouldn't actually give you what you want for GPUs. And yes a TeamVectorRange is the right fix.

crtrott on 30 Mar 2017

Btw. this shouldn't be too hard to implement: on the host side we just add the pragmas to the loops (i.e. otherwise its the same as the TeamThreadRange) and on Cuda we do some index calculations to split the loop over both team and vector indicies.

crtrott on 30 Mar 2017

Yep, I had thought of the workaround in which I had a TeamThreadRange of 1 and ThreadVectorRange of N, which would work for many CPU cases where team size of 1 works well, but would probably tank on GPUs.

dholladay00 on 30 Mar 2017

I would also like to have this TeamVectorRange feature.