Kokkos: What is the portable view layout for nested parallel loop

Created on 14 Mar 2019 · 3Comments · Source: kokkos/kokkos

Say I have a View<int[R][X][Y][Z],E> a that tracks the x,y,z grid indices of elements of another view View<double[R][I][3]> b. I have a team policy loop that the threads access the R dimensions of a and each vector lanes do its own loop through some range of X,Y and Z (for example 27 neighbors of certain grid point) and access in it the indices of b and then access b:

...
using gridPoint =Kokkos::Array<int,3>;
parallel_for(TeamThreadRange(team,R),[&](const int& iR) {
  parallel_for(ThreadVectorRange(team,T),[&](const int& iT) {
     const gridPoint iTGrid = assignGridPoint(iT); //lane-specific grid points assignment
     forNeighbors(iTGrid, [&](const gridPoint& nGrid) {
        const auto& id = a(iR,nGrid[0],nGrid[1],nGrid[2]);
        for(int d = 0; d < 3; ++d) { doSomething(b(iR,id,d)); }
     });
  });
});
...

I have tested layout right, layout left and default layout and compare the them on CPU and GPU. But I'm not able to get consistent results that convince me any one of them is portable -- is there any suggestion I can use in terms of how to layout the view or rearrange the dimensions to get portable performance for both a and b?

question

Source