A on device 0:: 5 -3 1 14 2 3 3 15 3 2 -4 3
A on device 1:: 5 -3 1 14 0 4.2 2.6 9.4 3 2 -4 3
A on device 2:: 5 -3 1 14 0 4.2 2.6 9.4 0 0 -6.95238 -13.9048
A on host :: 5 -3 1 14 0 4.2 2.6 9.4 0 0 -6.95238 -13.9048
Of course each thread holds different values. My question is can I assume that in the end of the kernel execution the shared memory on device gets the values from the most recent updated thread? Or what is the mechanism of memory copy?
I do not fully understand what you are trying to do.
After the kernel execution has finished (all GPU threads have finished) you can safely copy back the memory to the host or read it from the host in case of mapped-memory.
You can not rely on the order of threads being executed.
Could you please describe your problem a bit more? Else I will close this ticket.
Using alpaka::acc::AccCpuSerial the array A on each thread/device held different values (except thread #2 had the same values as the host).
Changing to other acc types (AccCpuFibers, AccCpuOmp2Threads, and AccCpuThreads), the values of A seemed to be the same on every device and host.
A on device 0:: 5 -3 1 14 0 4.2 2.6 9.4 0 0 -6.95238 -13.9048
A on device 1:: 5 -3 1 14 0 4.2 2.6 9.4 0 0 -6.95238 -13.9048
A on device 2:: 5 -3 1 14 0 4.2 2.6 9.4 0 0 -6.95238 -13.9048
A on host :: 5 -3 1 14 0 4.2 2.6 9.4 0 0 -6.95238 -13.9048
So I am not sure about the memory hierarchy of the alpaca execution.
I still do not know what you are trying to do. Especially the part the array A on each thread/device causes confusion. Do you have multiple threads on the host? Do you have multiple devices and/or multiple real GPUs? Where does your memory come from? How do you copy the memory onto the device? How do you copy the memory back from the device after executing the kernel before printing it out?
There is no implicit memory copy at the end of a kernel execution (if you are not using mapped memory). You have to explicitly copy the memory back to the host.
The program was modified based on vectorAdd kernel example. There are a few changes
The program run only on CPU. I am still trying to get it run on GPU.
I just learned that we cannot print (std::cout) from kernel when executing on GPU.
Yes, but on GPU you can still use printf().
Just be aware that it has a limited buffer size if you print a lot, which can lead to surprises when things get truncated. Also, you need to actively synchronize the GPU to pull all printf's.
Thanks, ax3l.
@jiradaherbst Is it possible that you link your kernel or post it to this issue?
Here is my kernel. All advices and suggestions will be greatly appreciated.
class GaussKernel
{
public:
ALPAKA_NO_HOST_ACC_WARNING
template<
typename TAcc,
typename TElem,
typename TIdx>
ALPAKA_FN_ACC auto operator()(
TAcc const & acc,
TElem * const A,
TElem * const C,
TIdx const & numElements ) const
// let n = number of equations
// numElements = n*(n+1)
-> void
{
static_assert(
alpaka::dim::Dim<TAcc>::value == 1,
"The VectorAddKernel expects 1-dimensional indices!");
auto const gridThreadIdx(alpaka::idx::getIdx<alpaka::Grid, alpaka::Threads>(acc)[0u]);
auto const threadElemExtent(alpaka::workdiv::getWorkDiv<alpaka::Thread, alpaka::Elems>(acc)[0u]);
// threadElemExtent = n+1
// A represent augmented matrix in 1D array
for (TIdx i(0); i<threadElemExtent-2; i++)
{ //loop to perform the gauss elimination
if (gridThreadIdx >= i+1 && gridThreadIdx<(threadElemExtent-1))
{
double t=A[gridThreadIdx*threadElemExtent+i]/A[i*(threadElemExtent)+i];
for (TIdx j(0); j<threadElemExtent; j++)
{
//make the elements below the pivot elements equal to zero or elimnate the variables
A[gridThreadIdx*threadElemExtent+j]=A[gridThreadIdx*threadElemExtent+j]-t*A[i*(threadElemExtent)+j];
}
}
alpaka::block::sync::syncBlockThreads(acc);
}
for (TIdx i(threadElemExtent-1); i>0 ;i--)
{
//back-substitution
//C is an array whose values correspond to the values of x,y,z..
TIdx const ii(i-1);
C[ii]=A[ii*threadElemExtent+threadElemExtent-1];//make the variable to be calculated equal to the rhs of the last equation
for (TIdx j(ii+1);j<threadElemExtent-1;j++)
if (j!=ii) //then subtract all the lhs values except the coefficient of the variable whose value is being calculated
C[ii]=C[ii]-A[ii*threadElemExtent+j]*C[j];
C[ii]=C[ii]/A[ii*threadElemExtent+ii];
//now finally divide the rhs by the coefficient of the variable to be calculated
}
}
};
If I understand you corectly, you have printed A from within the kernel (your step 3). At which place have you done the tracing? Have you synchronized the threads before doing this? (Add a alpaka::block::sync::syncBlockThreads(acc); directly before tracing)
I print out A on device between the two for-loops (after forward elimination and before back substitution), so the thread synchronization is done before (as it is done in the first for-loop). The value in A now is the result of forward elimination and is used in the next step.
I have noticed that for the program running on different Acc, the alpaka choose different execution environment.
GaussKernelTester(
numElements:12
, accelerator: AccCpuSerial<1,m>
, kernel: 15VectorAddKernel
, workDiv: {gridBlockExtent: (3), blockThreadExtent: (1), threadElemExtent: (4)}
)
GaussKernelTester(
numElements:12
, accelerator: AccCpuThreads<1,m>
, kernel: 15VectorAddKernel
, workDiv: {gridBlockExtent: (1), blockThreadExtent: (3), threadElemExtent: (4)}
)
GaussKernelTester(
numElements:12
, accelerator: AccCpuOmp2Threads<1,m>
, kernel: 15VectorAddKernel
, workDiv: {gridBlockExtent: (1), blockThreadExtent: (3), threadElemExtent: (4)}
)
GaussKernelTester(
numElements:12
, accelerator: AccCpuFibers<1,m>
, kernel: 15VectorAddKernel
, workDiv: {gridBlockExtent: (1), blockThreadExtent: (3), threadElemExtent: (4)}
)
I set the threadElemExtent equal to the columns of the input augmented matrix (in this case 4). Alpaka created the same execution environment for the AccCupThreads, AccCpuOmp2Threads and AccCpuFigers, i.e., 1 gridBlock and 3 threads per block (did I understand correctly?!? :). And for the AccCpuSerial, Alpaka created 3 gridBlocks and 1 thread per block. May this cause the result of A as reported? I am wondering why (on AccCpuSerial) A on device 2 is chosen to proceed in the next step and to copy back to host.
Yes, you are correctly seeing different work divisions being used by different accelerators.
The AccCpuSerial can not execute anything in parallel. Therefore it has to execute 3 blocks with each only 1 thread where the other accelerators which support parallel execution only need 1 block where 3 threads are executed in parallel.
However, in the end there will always be 3 threads. The only difference is if they are executed in parallel (1 block; 3 threads) or sequential (3 blocks; 1 thread). So the kernel will always be called 3 times.
Has this issue been resolved or are there more questions?
No more questions for now. :)