Alpaka: Slow memory copy

Created on 10 Jan 2020  路  7Comments  路  Source: alpaka-group/alpaka

Hello,
We are trying to use alpaka and cupla to port a code on multiple architecture. Everything works but we noticed that the same code written in alpaka is ~two times slower than the one written in cupla. After a quick investigation we found out that the problem seems in the copy operations done by alpaka.
Indeed, running the algorithm with cupla takes ~1800us, where about ~100us are taken by the copies; as regards alpaka the algorithm takes ~3100us, where ~1300us are taken by the copies. (On GPU, but we have the same problem on CPU)

The copies in alpaka are done in the following way, given two objects Input and Output:

  using Dim = alpaka::dim::DimInt<1u>;
  using Idx = uint64_t;
  using Extent = uint64_t;
  using Acc = alpaka::acc::AccGpuCudaRt<Dim, Extent>;
  using DevHost = alpaka::dev::DevCpu;
  using DevAcc = alpaka::dev::Dev<Acc>;
  using PltfHost = alpaka::pltf::Pltf<DevHost>;
  using PltfAcc = alpaka::pltf::Pltf<DevAcc>;
  using Queue = alpaka::queue::QueueCudaRtNonBlocking;
  using WorkDiv = alpaka::workdiv::WorkDivMembers<Dim, Idx>;
  using Vec = alpaka::vec::Vec<Dim, Idx>;

  Idx const elements(1);
  Vec const extent(elements);

using ViewInput = alpaka::mem::view::ViewPlainPtr<DevHost, const Input, Dim, Idx>;
ViewInput input_hBuf(&input, devHost, extent);

using BufDevInput = alpaka::mem::buf::Buf<DevAcc, Input, Dim, Idx>;
BufDevInput input_dBuf(alpaka::mem::buf::alloc<Input, Idx>(devAcc, extent));

using ViewOutput = alpaka::mem::view::ViewPlainPtr<DevHost, Output, Dim, Idx>;
ViewOutput output_hBuf(&output, devHost, extent);

using BufDevOutput = alpaka::mem::buf::Buf<DevAcc, Output, Dim, Idx>;
BufDevOutput output_dBuf(alpaka::mem::buf::alloc<Output, Idx>(devAcc, extent));

alpaka::mem::view::getPtrNative(output_hBuf)
          ->err.construct(pixelgpudetails::MAX_FED_WORDS, alpaka::mem::view::getPtrNative(output_dBuf)->err_d);

alpaka::mem::view::copy(queue, input_dBuf, input_hBuf, extent);
alpaka::mem::view::copy(queue, output_dBuf, output_hBuf, extent);

Input and Output are two struct alignas(128)
The code is compiled with the following flags:

# host compiler
CXX := g++
CXX_FLAGS := -O2 -std=c++14
CXX_DEBUG := -g

# CUDA compiler
NVCC := $(CUDA_BASE)/bin/nvcc -ccbin $(CXX)
NVCC_FLAGS := -O2 -std=c++14 --expt-relaxed-constexpr -w --generate-code arch=compute_35,code=sm_35 --generate-code arch=compute_50,code=sm_50 --generate-code arch=compute_60,code=sm_60 --generate-code arch=compute_70,code=sm_70
NVCC_DEBUG := -g -lineinfo

Are we missing something?

Thanks

CUDA Question

Most helpful comment

Ok, pinning the memory worked! Now the performance are the same for cupla and alpaka!
Now the copy is done in the following way:

using Dim = alpaka::dim::DimInt<1u>;
  using Idx = uint64_t;
  using Extent = uint64_t;
  using Acc = alpaka::acc::AccGpuCudaRt<Dim, Extent>;
  using DevHost = alpaka::dev::DevCpu;
  using DevAcc = alpaka::dev::Dev<Acc>;
  using PltfHost = alpaka::pltf::Pltf<DevHost>;
  using PltfAcc = alpaka::pltf::Pltf<DevAcc>;
  using Queue = alpaka::queue::QueueCudaRtNonBlocking;
  using WorkDiv = alpaka::workdiv::WorkDivMembers<Dim, Idx>;
  using Vec = alpaka::vec::Vec<Dim, Idx>;

  Idx const elements(1);
  Vec const extent(elements);

      auto input_dBuf = alpaka::mem::buf::alloc<Input, Idx>(device, extent);
      Input* input_d = alpaka::mem::view::getPtrNative(input_dBuf);
      auto input_hBuf = alpaka::mem::buf::alloc<Input, Idx>(host, extent);
      alpaka::mem::buf::prepareForAsyncCopy(input_hBuf);
      Input* input_h = alpaka::mem::view::getPtrNative(input_hBuf);
      std::memcpy(input_h, &input, sizeof(Input));

      auto output_dBuf = alpaka::mem::buf::alloc<Output, Idx>(device, extent);
      Output* output_d = alpaka::mem::view::getPtrNative(output_dBuf);
      auto output_hBuf = alpaka::mem::buf::alloc<Output, Idx>(host, extent);
      alpaka::mem::buf::prepareForAsyncCopy(output_hBuf);

      alpaka::mem::view::copy(queue, input_dBuf, input_hBuf, extent);
      alpaka::mem::view::copy(queue, output_dBuf, output_hBuf, extent);

We had to change the ViewPlainPtr to a 'standard' buffer.

Thanks!

All 7 comments

Note: All cupla buffer created with cuplaMallocHost or cuplaMalloc are mapped to byte arrays. Alpaka buffers are type save and using the type given to the buffer to allocate a buffer. This could have impact on the performance.

If you use CUDA and cupla we are pinning all buffers created with cuplaMallocHost https://github.com/ComputationalRadiationPhysics/cupla/blob/0594a68a0d9bdbfc949391f83473d4734575a7f5/src/memory.cpp#L157-L160
This will have a positive impact on the performance. Pinning in alpaka is currently only possible if you use CUDA and required ifdef in the user code :-(

I am not sure why you can see this performance difference also on the CPU, but maybe it has something to do with the CUPLA byte arrays.

@waredjeb thanks for your report. I believe in alpaka one can ask the host buffer to be pinned with this utility, that will do nothing without CUDA and pin with CUDA.

@waredjeb I am not sure which version of alpaka you use but since pull request #896 we are aligning alpaka host buffer to 4kib. This means if you have enabled a CUDA accelerator + CPU accelerator but only use the CPU accelerator all your host buffers will be aligned to 4kib. This can have a positive impact on the CPU code performance because each buffer is on its own memory page.

Thanks for clarification. I'll try your suggestions and I'll let you know!

Ok, pinning the memory worked! Now the performance are the same for cupla and alpaka!
Now the copy is done in the following way:

using Dim = alpaka::dim::DimInt<1u>;
  using Idx = uint64_t;
  using Extent = uint64_t;
  using Acc = alpaka::acc::AccGpuCudaRt<Dim, Extent>;
  using DevHost = alpaka::dev::DevCpu;
  using DevAcc = alpaka::dev::Dev<Acc>;
  using PltfHost = alpaka::pltf::Pltf<DevHost>;
  using PltfAcc = alpaka::pltf::Pltf<DevAcc>;
  using Queue = alpaka::queue::QueueCudaRtNonBlocking;
  using WorkDiv = alpaka::workdiv::WorkDivMembers<Dim, Idx>;
  using Vec = alpaka::vec::Vec<Dim, Idx>;

  Idx const elements(1);
  Vec const extent(elements);

      auto input_dBuf = alpaka::mem::buf::alloc<Input, Idx>(device, extent);
      Input* input_d = alpaka::mem::view::getPtrNative(input_dBuf);
      auto input_hBuf = alpaka::mem::buf::alloc<Input, Idx>(host, extent);
      alpaka::mem::buf::prepareForAsyncCopy(input_hBuf);
      Input* input_h = alpaka::mem::view::getPtrNative(input_hBuf);
      std::memcpy(input_h, &input, sizeof(Input));

      auto output_dBuf = alpaka::mem::buf::alloc<Output, Idx>(device, extent);
      Output* output_d = alpaka::mem::view::getPtrNative(output_dBuf);
      auto output_hBuf = alpaka::mem::buf::alloc<Output, Idx>(host, extent);
      alpaka::mem::buf::prepareForAsyncCopy(output_hBuf);

      alpaka::mem::view::copy(queue, input_dBuf, input_hBuf, extent);
      alpaka::mem::view::copy(queue, output_dBuf, output_hBuf, extent);

We had to change the ViewPlainPtr to a 'standard' buffer.

Thanks!

Glad it helped @waredjeb .
@BenjaminW3 @psychocoderHPC do you think we should put this info to alpaka equivalent of cudaHostAlloc here and/or somewhere else in the docs?

Now that the docs are updated as well, closing the issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

BenjaminW3 picture BenjaminW3  路  3Comments

psychocoderHPC picture psychocoderHPC  路  4Comments

ax3l picture ax3l  路  5Comments

tdd11235813 picture tdd11235813  路  4Comments

tdd11235813 picture tdd11235813  路  5Comments