This issue is for tracking the initial design and prototyping of a CuBLAS interface for the LinearAlgebra module.
Draft implementation of this can be found here: https://github.com/slnguyen/chapel/tree/cublas_wrapper/gpu_wrapper/chpl_cublas
User facing code for daxpy:
https://github.com/slnguyen/chapel/blob/cublas_wrapper/gpu_wrapper/chpl_cublas/chpl_cublas_daxpy_em.chpl
Currently a user creates a chapel array then explicitly moves it over to the gpu with cpu_to_gpu. cpu_to_gpu returns a pointer to an array on the gpu which can be used later (see example below). Two disadvantages of this approach:
real(64) in the cu_daxpy function. //allocate arrays
var X: [1..N] real(64);
var Y: [1..N] real(64);
//put values in arrays
X = 3.0: real(64);
Y = 5.0: real(64);
var gpu_ptr_X = cpu_to_gpu(c_ptrTo(X), c_sizeof(real(64))*N:size_t);
var gpu_ptr_Y = cpu_to_gpu(c_ptrTo(Y), c_sizeof(real(64))*N:size_t);
...
//use cublas daxpy
cu_daxpy(cublas_handle, N, gpu_ptr_X:c_ptr(real(64)), gpu_ptr_Y:c_ptr(real(64)), a);
gpu_to_cpu(c_ptrTo(Y), gpu_ptr_Y, c_sizeof(real)*N:size_t);
As a reference cupy creates an array accessible by the gpu with the following
import numpy as np
import cupy as cp
x_cpu = np.array([1, 2, 3])
l2_cpu = np.linalg.norm(x_cpu)
x_gpu = cp.array([1, 2, 3])
l2_gpu = cp.linalg.norm(x_gpu)
A few high-level design questions:
cpu_to_gpu(...) function above)?@bradcray @mppf @e-kayrakli you might be interested. Let me know if you have any questions or feedback
My two cents that are motivated mostly from the interface standpoint and not so much about what's happening under the hood and/or better for performance:
Should we do something similar to cupy, i.e. abstracting away data movement from cpu to gpu (e.g. cpu_to_gpu(...) function above)?
I think so. However, we may consider ways to do that data movement at a low-level similar to other multiresolution features that we have. Ideally, such data movements should be hidden from the user and can be handled by the compiler/runtime/module support.
If (1), how should a chapel gpu array be created? For example should we have a new gpu domain map that manages pointers to gpu data
Some long time ago, when I first thought about wrapping cuBLAS in Chapel, I thought that an initial implementation can be a simple wrapper type around Chapel arrays that also stores a pointer to GPU memory. In that scheme, cuBLAS wrappers would take an instance of that wrapper type. I am not sure whether that's where we should be in the long term. However, I think it can be a good step. As we discover difficulties we can move forward to something that is a more first-class citizen interface. That may look like cuBLAS functions taking Chapel arrays and creating wrappers internally, and ultimately a domain map implementation for GPU array.
I think the way I look at cuBLAS wrapper is that it is more self-contained than having GPU support for Chapel arrays. However, it is a very good way to increase our experience in using GPUs with Chapel arrays so that we can move to a world where GPUs are supported more natively.
- Should we do something similar to cupy, i.e. abstracting away data movement from cpu to gpu (e.g. cpu_to_gpu(...) function above)?
I think so. However, we may consider ways to do that data movement at a low-level similar to other multiresolution features that we have. Ideally, such data movements should be hidden from the user and can be handled by the compiler/runtime/module support.
Yeah, I'd really like for a high-level interface like this in the end of the day. Having a second interface with more control of data movement at the cost of verbosity/complexity would be nice and follows Chapel's multiresolution philosophy. However, the BLAS interface is pretty big and implementing 2 fully separate interfaces would be a lot of effort unless we invested into some Chapel code generation tools to create the interfaces for us, similar to what was done for the LAPACK interface.
Some long time ago, when I first thought about wrapping cuBLAS in Chapel, I thought that an initial implementation can be a simple wrapper type around Chapel arrays that also stores a pointer to GPU memory. In that scheme, cuBLAS wrappers would take an instance of that wrapper type. I am not sure whether that's where we should be in the long term.
I like this idea for an initial implementation. Creating a domain map can be a pretty big effort and comes with lots of complex design questions. Sticking to a simpler possibly temporary GPU array wrapper type would let us move forward quickly and put those questions off until we are further along on design of principled GPU support in the language.
I agree with the others.
I wonder if we could use something like chpl_external_array to make the low-level CUDA arrays. We could have a function to copy an array to GPU and produce such a record that is really just a super low level array. We could support copying it back with whole array operations.
I've been quiet, because I generally agree with the feedback given here that I think over time we'll want to build out additional abstractions over the pointer-like aspects of the draft. E.g., I'd ideally like to call a Chapeltastic cuDaxpy() passing it X and Y and have cuDaxpy() itself do the array unwrapping / copying / pointer manipulations before calling the real cu_daxpy(), and then reversing that on the return path, returning a normal array to Chapel.
Most helpful comment
My two cents that are motivated mostly from the interface standpoint and not so much about what's happening under the hood and/or better for performance:
I think so. However, we may consider ways to do that data movement at a low-level similar to other multiresolution features that we have. Ideally, such data movements should be hidden from the user and can be handled by the compiler/runtime/module support.
Some long time ago, when I first thought about wrapping cuBLAS in Chapel, I thought that an initial implementation can be a simple wrapper type around Chapel arrays that also stores a pointer to GPU memory. In that scheme, cuBLAS wrappers would take an instance of that wrapper type. I am not sure whether that's where we should be in the long term. However, I think it can be a good step. As we discover difficulties we can move forward to something that is a more first-class citizen interface. That may look like cuBLAS functions taking Chapel arrays and creating wrappers internally, and ultimately a domain map implementation for GPU array.
I think the way I look at cuBLAS wrapper is that it is more self-contained than having GPU support for Chapel arrays. However, it is a very good way to increase our experience in using GPUs with Chapel arrays so that we can move to a world where GPUs are supported more natively.