Say I have a View<int[R][X][Y][Z],E> a that tracks the x,y,z grid indices of elements of another view View<double[R][I][3]> b. I have a team policy loop that the threads access the R dimensions of a and each vector lanes do its own loop through some range of X,Y and Z (for example 27 neighbors of certain grid point) and access in it the indices of b and then access b:
...
using gridPoint =Kokkos::Array<int,3>;
parallel_for(TeamThreadRange(team,R),[&](const int& iR) {
parallel_for(ThreadVectorRange(team,T),[&](const int& iT) {
const gridPoint iTGrid = assignGridPoint(iT); //lane-specific grid points assignment
forNeighbors(iTGrid, [&](const gridPoint& nGrid) {
const auto& id = a(iR,nGrid[0],nGrid[1],nGrid[2]);
for(int d = 0; d < 3; ++d) { doSomething(b(iR,id,d)); }
});
});
});
...
I have tested layout right, layout left and default layout and compare the them on CPU and GPU. But I'm not able to get consistent results that convince me any one of them is portable -- is there any suggestion I can use in terms of how to layout the view or rearrange the dimensions to get portable performance for both a and b?
Wait, can functors for team parallel fors even take an int as their input? They should take one of those team member thingies instead.
Actually, it was correct in the original post.
(Also, there's no value to passing int by const reference. Just pass it by (const) value.)