Alpaka: Memory copy from device to host

Created on 9 May 2018 · 15Comments · Source: alpaka-group/alpaka

Last one is the question regarding memory. I designed my program so that each thread is taking care of a different row for the forward elimination step. And thread ID 2 is the one doing the last iteration. I printed out the values of the array

A on device 0:: 5 -3 1 14 2 3 3 15 3 2 -4 3
A on device 1:: 5 -3 1 14 0 4.2 2.6 9.4 3 2 -4 3
A on device 2:: 5 -3 1 14 0 4.2 2.6 9.4 0 0 -6.95238 -13.9048
A on host :: 5 -3 1 14 0 4.2 2.6 9.4 0 0 -6.95238 -13.9048

Of course each thread holds different values. My question is can I assume that in the end of the kernel execution the shared memory on device gets the values from the most recent updated thread? Or what is the mechanism of memory copy?

Question

Source

jiradaherbst

All 15 comments

I do not fully understand what you are trying to do.
After the kernel execution has finished (all GPU threads have finished) you can safely copy back the memory to the host or read it from the host in case of mapped-memory.
You can not rely on the order of threads being executed.

BenjaminW3 on 11 May 2018

Could you please describe your problem a bit more? Else I will close this ticket.

BenjaminW3 on 15 May 2018

Using alpaka::acc::AccCpuSerial the array A on each thread/device held different values (except thread #2 had the same values as the host).

Changing to other acc types (AccCpuFibers, AccCpuOmp2Threads, and AccCpuThreads), the values of A seemed to be the same on every device and host.
A on device 0:: 5 -3 1 14 0 4.2 2.6 9.4 0 0 -6.95238 -13.9048
A on device 1:: 5 -3 1 14 0 4.2 2.6 9.4 0 0 -6.95238 -13.9048
A on device 2:: 5 -3 1 14 0 4.2 2.6 9.4 0 0 -6.95238 -13.9048
A on host :: 5 -3 1 14 0 4.2 2.6 9.4 0 0 -6.95238 -13.9048

So I am not sure about the memory hierarchy of the alpaca execution.

jiradaherbst on 15 May 2018

I still do not know what you are trying to do. Especially the part the array A on each thread/device causes confusion. Do you have multiple threads on the host? Do you have multiple devices and/or multiple real GPUs? Where does your memory come from? How do you copy the memory onto the device? How do you copy the memory back from the device after executing the kernel before printing it out?
There is no implicit memory copy at the end of a kernel execution (if you are not using mapped memory). You have to explicitly copy the memory back to the host.

BenjaminW3 on 15 May 2018

The program was modified based on vectorAdd kernel example. There are a few changes

I read some input into A (instead of random generate number) and I removed array B.
Array A was copied from host to acc using the same command as in vectorAdd.
Kernel was modified to execute Gauss elimination (forward elimination and back substitution).
A on device .. :: were printed here from each thread.
Note: sorry for confusion of terms device/thread/acc/etc.
After kernel execution, A was copied back to host using the same command as in vectorAdd.
A on host:: was printed after it was copied back.

The program run only on CPU. I am still trying to get it run on GPU.

jiradaherbst on 15 May 2018

I just learned that we cannot print (std::cout) from kernel when executing on GPU.

jiradaherbst on 16 May 2018

👍1

Yes, but on GPU you can still use printf().
Just be aware that it has a limited buffer size if you print a lot, which can lead to surprises when things get truncated. Also, you need to actively synchronize the GPU to pull all printf's.

ax3l on 16 May 2018

Thanks, ax3l.

jiradaherbst on 16 May 2018

@jiradaherbst Is it possible that you link your kernel or post it to this issue?

psychocoderHPC on 17 May 2018

Here is my kernel. All advices and suggestions will be greatly appreciated.

class GaussKernel
{
public:
    ALPAKA_NO_HOST_ACC_WARNING
    template<
        typename TAcc,
        typename TElem,
        typename TIdx>
    ALPAKA_FN_ACC auto operator()(
        TAcc const & acc,
        TElem * const A,
        TElem * const C,
        TIdx const & numElements ) const
        // let n = number of equations
        // numElements = n*(n+1)
    -> void
    {
        static_assert(
            alpaka::dim::Dim<TAcc>::value == 1,
            "The VectorAddKernel expects 1-dimensional indices!");

        auto const gridThreadIdx(alpaka::idx::getIdx<alpaka::Grid, alpaka::Threads>(acc)[0u]);

        auto const threadElemExtent(alpaka::workdiv::getWorkDiv<alpaka::Thread, alpaka::Elems>(acc)[0u]);
        // threadElemExtent = n+1

        // A represent augmented matrix in 1D array
        for (TIdx i(0); i<threadElemExtent-2; i++)
        {        //loop to perform the gauss elimination
                if (gridThreadIdx >= i+1 && gridThreadIdx<(threadElemExtent-1))
                {
                    double t=A[gridThreadIdx*threadElemExtent+i]/A[i*(threadElemExtent)+i];

                    for (TIdx j(0); j<threadElemExtent; j++)
                    {
                        //make the elements below the pivot elements equal to zero or elimnate the variables
                        A[gridThreadIdx*threadElemExtent+j]=A[gridThreadIdx*threadElemExtent+j]-t*A[i*(threadElemExtent)+j];
                    }
                }
                alpaka::block::sync::syncBlockThreads(acc);
        }


        for (TIdx i(threadElemExtent-1); i>0 ;i--)
        {
                       //back-substitution
                        //C is an array whose values correspond to the values of x,y,z..
                    TIdx const ii(i-1);
                    C[ii]=A[ii*threadElemExtent+threadElemExtent-1];//make the variable to be calculated equal to the rhs of the last equation

                    for (TIdx j(ii+1);j<threadElemExtent-1;j++)
                            if (j!=ii)            //then subtract all the lhs values except the coefficient of the variable whose value is being calculated
                                C[ii]=C[ii]-A[ii*threadElemExtent+j]*C[j];
                        C[ii]=C[ii]/A[ii*threadElemExtent+ii];
                              //now finally divide the rhs by the coefficient of the variable to be calculated
        }
    }
};

jiradaherbst on 17 May 2018

👍1

If I understand you corectly, you have printed A from within the kernel (your step 3). At which place have you done the tracing? Have you synchronized the threads before doing this? (Add a alpaka::block::sync::syncBlockThreads(acc); directly before tracing)

BenjaminW3 on 22 May 2018

I print out A on device between the two for-loops (after forward elimination and before back substitution), so the thread synchronization is done before (as it is done in the first for-loop). The value in A now is the result of forward elimination and is used in the next step.

I have noticed that for the program running on different Acc, the alpaka choose different execution environment.

GaussKernelTester(
numElements:12
, accelerator: AccCpuSerial<1,m>
, kernel: 15VectorAddKernel
, workDiv: {gridBlockExtent: (3), blockThreadExtent: (1), threadElemExtent: (4)}
)

GaussKernelTester(
numElements:12
, accelerator: AccCpuThreads<1,m>
, kernel: 15VectorAddKernel
, workDiv: {gridBlockExtent: (1), blockThreadExtent: (3), threadElemExtent: (4)}
)

GaussKernelTester(
numElements:12
, accelerator: AccCpuOmp2Threads<1,m>
, kernel: 15VectorAddKernel
, workDiv: {gridBlockExtent: (1), blockThreadExtent: (3), threadElemExtent: (4)}
)

GaussKernelTester(
numElements:12
, accelerator: AccCpuFibers<1,m>
, kernel: 15VectorAddKernel
, workDiv: {gridBlockExtent: (1), blockThreadExtent: (3), threadElemExtent: (4)}
)

I set the threadElemExtent equal to the columns of the input augmented matrix (in this case 4). Alpaka created the same execution environment for the AccCupThreads, AccCpuOmp2Threads and AccCpuFigers, i.e., 1 gridBlock and 3 threads per block (did I understand correctly?!? :). And for the AccCpuSerial, Alpaka created 3 gridBlocks and 1 thread per block. May this cause the result of A as reported? I am wondering why (on AccCpuSerial) A on device 2 is chosen to proceed in the next step and to copy back to host.

jiradaherbst on 23 May 2018

Yes, you are correctly seeing different work divisions being used by different accelerators.
The AccCpuSerial can not execute anything in parallel. Therefore it has to execute 3 blocks with each only 1 thread where the other accelerators which support parallel execution only need 1 block where 3 threads are executed in parallel.
However, in the end there will always be 3 threads. The only difference is if they are executed in parallel (1 block; 3 threads) or sequential (3 blocks; 1 thread). So the kernel will always be called 3 times.

BenjaminW3 on 26 May 2018

👍1

Has this issue been resolved or are there more questions?

BenjaminW3 on 11 Jun 2018

No more questions for now. :)

jiradaherbst on 13 Jun 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

clangFormat broke indetation of complex compile-time guards

jkelling · 3Comments

style changes to prepare automatic reformatting

BenjaminW3 · 6Comments

Clang still fails

tdd11235813 · 4Comments

CMake CUDA: 'target_compile_options' has no effect

tdd11235813 · 5Comments

Make sure that the kernel function returns void with `ALPAKA_ACC_GPU_CUDA_ONLY_MODE`

BenjaminW3 · 5Comments