Chapel: LinearAlgebra CuBLAS interface

Created on 5 May 2020 · 6Comments · Source: chapel-lang/chapel

This issue is for tracking the initial design and prototyping of a CuBLAS interface for the LinearAlgebra module.

Libraries / Modules Design Feature Request

Source

ben-albrecht

Most helpful comment

My two cents that are motivated mostly from the interface standpoint and not so much about what's happening under the hood and/or better for performance:

Should we do something similar to cupy, i.e. abstracting away data movement from cpu to gpu (e.g. cpu_to_gpu(...) function above)?

I think so. However, we may consider ways to do that data movement at a low-level similar to other multiresolution features that we have. Ideally, such data movements should be hidden from the user and can be handled by the compiler/runtime/module support.

If (1), how should a chapel gpu array be created? For example should we have a new gpu domain map that manages pointers to gpu data

Some long time ago, when I first thought about wrapping cuBLAS in Chapel, I thought that an initial implementation can be a simple wrapper type around Chapel arrays that also stores a pointer to GPU memory. In that scheme, cuBLAS wrappers would take an instance of that wrapper type. I am not sure whether that's where we should be in the long term. However, I think it can be a good step. As we discover difficulties we can move forward to something that is a more first-class citizen interface. That may look like cuBLAS functions taking Chapel arrays and creating wrappers internally, and ultimately a domain map implementation for GPU array.

I think the way I look at cuBLAS wrapper is that it is more self-contained than having GPU support for Chapel arrays. However, it is a very good way to increase our experience in using GPUs with Chapel arrays so that we can move to a world where GPUs are supported more natively.

e-kayrakli on 29 Jul 2020

👍2

All 6 comments

Draft implementation of this can be found here: https://github.com/slnguyen/chapel/tree/cublas_wrapper/gpu_wrapper/chpl_cublas

User facing code for daxpy:
https://github.com/slnguyen/chapel/blob/cublas_wrapper/gpu_wrapper/chpl_cublas/chpl_cublas_daxpy_em.chpl

Currently a user creates a chapel array then explicitly moves it over to the gpu with cpu_to_gpu. cpu_to_gpu returns a pointer to an array on the gpu which can be used later (see example below). Two disadvantages of this approach:

The pointer type needs to be explicitly specified when calling a cublas function. For example, in the code below I specify that the gpu pointer as real(64) in the cu_daxpy function.
The user must call a function to move data from the cpu to gpu and vice versa.

  //allocate arrays
  var X: [1..N] real(64);
  var Y: [1..N] real(64);

  //put values in arrays
  X = 3.0: real(64);
  Y = 5.0: real(64);

  var gpu_ptr_X = cpu_to_gpu(c_ptrTo(X), c_sizeof(real(64))*N:size_t);
  var gpu_ptr_Y = cpu_to_gpu(c_ptrTo(Y), c_sizeof(real(64))*N:size_t);

   ...

  //use cublas daxpy
  cu_daxpy(cublas_handle, N, gpu_ptr_X:c_ptr(real(64)), gpu_ptr_Y:c_ptr(real(64)), a);
  gpu_to_cpu(c_ptrTo(Y), gpu_ptr_Y, c_sizeof(real)*N:size_t);

As a reference cupy creates an array accessible by the gpu with the following

import numpy as np
import cupy as cp

x_cpu = np.array([1, 2, 3])
l2_cpu = np.linalg.norm(x_cpu)

x_gpu = cp.array([1, 2, 3])
l2_gpu = cp.linalg.norm(x_gpu)

A few high-level design questions:

Should we do something similar to cupy, i.e. abstracting away data movement from cpu to gpu (e.g. cpu_to_gpu(...) function above)?
If (1), how should a chapel gpu array be created? For example should we have a new gpu domain map that manages pointers to gpu data?

slnguyen on 29 Jul 2020

👍1

@bradcray @mppf @e-kayrakli you might be interested. Let me know if you have any questions or feedback

slnguyen on 29 Jul 2020

My two cents that are motivated mostly from the interface standpoint and not so much about what's happening under the hood and/or better for performance:

Should we do something similar to cupy, i.e. abstracting away data movement from cpu to gpu (e.g. cpu_to_gpu(...) function above)?

If (1), how should a chapel gpu array be created? For example should we have a new gpu domain map that manages pointers to gpu data

e-kayrakli on 29 Jul 2020

👍2

Should we do something similar to cupy, i.e. abstracting away data movement from cpu to gpu (e.g. cpu_to_gpu(...) function above)?

I think so. However, we may consider ways to do that data movement at a low-level similar to other multiresolution features that we have. Ideally, such data movements should be hidden from the user and can be handled by the compiler/runtime/module support.

Yeah, I'd really like for a high-level interface like this in the end of the day. Having a second interface with more control of data movement at the cost of verbosity/complexity would be nice and follows Chapel's multiresolution philosophy. However, the BLAS interface is pretty big and implementing 2 fully separate interfaces would be a lot of effort unless we invested into some Chapel code generation tools to create the interfaces for us, similar to what was done for the LAPACK interface.

Some long time ago, when I first thought about wrapping cuBLAS in Chapel, I thought that an initial implementation can be a simple wrapper type around Chapel arrays that also stores a pointer to GPU memory. In that scheme, cuBLAS wrappers would take an instance of that wrapper type. I am not sure whether that's where we should be in the long term.

I like this idea for an initial implementation. Creating a domain map can be a pretty big effort and comes with lots of complex design questions. Sticking to a simpler possibly temporary GPU array wrapper type would let us move forward quickly and put those questions off until we are further along on design of principled GPU support in the language.

ben-albrecht on 29 Jul 2020

I agree with the others.

I wonder if we could use something like chpl_external_array to make the low-level CUDA arrays. We could have a function to copy an array to GPU and produce such a record that is really just a super low level array. We could support copying it back with whole array operations.

mppf on 29 Jul 2020

I've been quiet, because I generally agree with the feedback given here that I think over time we'll want to build out additional abstractions over the pointer-like aspects of the draft. E.g., I'd ideally like to call a Chapeltastic cuDaxpy() passing it X and Y and have cuDaxpy() itself do the array unwrapping / copying / pointer manipulations before calling the real cu_daxpy(), and then reversing that on the return path, returning a normal array to Chapel.

bradcray on 31 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Should "new initializers" support field initialization prior to super.init()?

bradcray · 3Comments

Passing a function with an array argument results in internal error

ty1027 · 3Comments

Adding entries to map of shared type results in an error

ben-albrecht · 3Comments

parallel block scan results are incorrect

BryantLam · 3Comments

Overloading assignment (=) on class types

bradcray · 4Comments