Hi,
cudaHostAlloc and cudaMallocHost allow allocating write-combined memory:
cudaHostAllocWriteCombined: Allocates the memory as write-combined (WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host->device transfers.
Is there an equivalent functionality in Alpaka or Cupla ?
@darcato FYI
Currently, we are not supporting the different memory flags for CUDA.
A not so nice workaround is to wrapp it in a factory:
// pseudocode
void *ptr = nullptr;
size_t size = 1024u;
#if( ALPAKA_ACC_GPU_CUDA_ENABLED == 1 )
CUDA_CHECK((cuplaError_t)cudaHostAlloc(
&ptr, size * sizeof (Type),
cudaHostAllocMapped));
using ViewPlainPtr = alpaka::mem::view::ViewPlainPtr<Dev, Type, Dim, Idx>;
ViewPlainPtr plainBuf(ptr, dev, vec::Vec<1u, size_t>(size));
#else
//create normal alpaka buffer and expose this buffer as ViewPlainPtr
#endif
so, we could do something like this with Cupla ?
#ifdef cudaHostAlloc
#undef cudaHostAlloc
#endif
CUPLA_HEADER_ONLY_FUNC_SPEC
cuplaError_t
cuplaHostAlloc(
void **ptrptr,
size_t size,
unsigned int flags
)
{
#if ALPAKA_ACC_GPU_CUDA_ENABLED == 1
// if compiling for CUDA, use the native allocation functions
return (cuplaError_t) cudaHostAlloc(ptrptr, size, flags);
#else
// otherwise, use cuplaMallocHost
return cuplaMallocHost(ptrptr, size);
#endif
}
?
@fwyzard looks reasonable. Though I do not think the starting #ifdef block is correct: we do not define cudaHostAlloc in cupla and it is a function in CUDA API.
@fwyzard If you like you can extend cupla. cudaHostAlloc must be undefied to have access to native cuda functions and afterward we reintroduce it again.
// we will add the following line to the cuda_to_cupla.hpp
#define cudaHostAlloc(...) cuplaHostAlloc(__VA_ARGS__)
// this code will be added to cupla memory.hpp/cpp
CUPLA_HEADER_ONLY_FUNC_SPEC
cuplaError_t
cuplaHostAlloc(
void **ptrptr,
size_t size,
unsigned int flags
)
{
#if ALPAKA_ACC_GPU_CUDA_ENABLED == 1
#undef cudaHostAlloc
// if compiling for CUDA, use the native allocation functions
auto tmp = (cuplaError_t) cudaHostAlloc(ptrptr, size, flags);
#define cudaHostAlloc(...) cuplaHostAlloc(__VA_ARGS__)
return tmp;
#else
// otherwise, use cuplaMallocHost
return cuplaMallocHost(ptrptr, size);
#endif
}
FYI, I did some more benchmarks of our application, and I didn't find any benefits from write-combined memory, so we likely won't be using it any more.
FYI, I did some more benchmarks of our application, and I didn't find any benefits from write-combined memory, so we likely won't be using it any more.
Thanks for this information.
So the provided memory allocation methods by cupla/alpaka are enough for your use cases?
We're in the process of porting more code to Alpaka, but in principle think so.
We're in the process of porting more code to Alpaka, but in principle think so.
We are on the way to release a new version of alpaka: https://github.com/alpaka-group/alpaka/tree/release-0.6.0-rc3
In the new release, we refactored the namespace to reduce the depth.