Cupy: Using multiple GPUs for storing cupy arrays

Created on 3 Dec 2020  路  3Comments  路  Source: cupy/cupy

I want to create and manipulate a cupy array that doesn't fit on a single GPU memory, but it does on two. When I create the array, I was expecting that cupy would automatically use the combined memory of the two GPUs I have to store the array, but I get OutOfMemoryError.

Is there a way around it?

issue-checked

Most helpful comment

Let me show you a very hacky way to achieve simple operations using managed memory. CuPy was not intended to be used in this way AFAIK, at least I don't think we included tests like this, but with all the flexibilities gradually built up in the past few versions, the following should be doable on CuPy v8.x:

import cupy


cupy.show_config()
cupy.cuda.set_allocator(cupy.cuda.malloc_managed)  # use cudaMallocManaged instead of cudaMalloc for CuPy's memory pool

with cupy.cuda.Device(0):
    a = cupy.ones(20 * 1024**3, dtype=cupy.int8)  # too large to be accommodated on 1 GTX 2080 Ti
    first_half = a[:10*1024**3]
    out_1 = first_half.sum()  # sum over the first half of the array on GPU 0

with cupy.cuda.Device(1):
    # Take the pointer to managed memory, and wrap it in a way that CuPy thinks there is already a legit array allocated on device 1
    mem = cupy.cuda.MemoryPointer(cupy.cuda.UnownedMemory(a[10*1024**3:].data.ptr, 10*1024**3, a, device_id=1), 0)
    second_half = cupy.ndarray(10*1024**3, dtype=cupy.int8, memptr=mem)
    out_2 = second_half.sum()  # sum over the second half of the array on GPU 1

# we print in the end, as we want both GPUs to work in parallel, but printing would incur data transfer and synchronization 
print()
print(out_1)
print(out_2)

Output:

OS                           : Linux-5.4.0-42-generic-x86_64-with-glibc2.10
CuPy Version                 : 8.2.0
NumPy Version                : 1.19.4
SciPy Version                : None
Cython Build Version         : 0.29.21
CUDA Root                    : /usr/local/cuda
CUDA Build Version           : 10000
CUDA Driver Version          : 11000
CUDA Runtime Version         : 10000
cuBLAS Version               : 10000
cuFFT Version                : 10000
cuRAND Version               : 10000
cuSOLVER Version             : (10, 0, 0)
cuSPARSE Version             : 10000
NVRTC Version                : (10, 0)
Thrust Version               : 100903
CUB Build Version            : 100800
cuDNN Build Version          : 7605
cuDNN Version                : 7605
NCCL Build Version           : 2604
NCCL Runtime Version         : 2604
cuTENSOR Version             : None
Device 0 Name                : GeForce RTX 2080 Ti
Device 0 Compute Capability  : 75
Device 1 Name                : GeForce RTX 2080 Ti
Device 1 Compute Capability  : 75

10737418240
10737418240

A snapshot of nvidia-smi showing both GPUs are kept busy (should really be verified using nvvp/nsys but I'm being sloppy):

| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:1A:00.0 Off |                  N/A |
| 18%   53C    P2    90W / 250W |  10653MiB / 11019MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:68:00.0 Off |                  N/A |
| 18%   44C    P2    91W / 250W |  10701MiB / 11016MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Again, the fact that this is possible does not mean it's guaranteed to work or as performant as you'd imagine for all kinds of tasks you need to do (in this case the majority of data is sitting on host for most of time). Just to show some interesting possibilities 馃槃 Either @pentschev's suggestion of using Dask or managing the array chunking yourself using, say, MPI/mpi4py is more realistic for production work.

All 3 comments

I believe you're looking for something like Dask (or, more specifically, Dask-CUDA), which can automatically split work on multiple GPUs and multiple nodes. See an example here.

Disclaimer: I'm a maintainer of Dask-CUDA and core dev of Dask.

Let me show you a very hacky way to achieve simple operations using managed memory. CuPy was not intended to be used in this way AFAIK, at least I don't think we included tests like this, but with all the flexibilities gradually built up in the past few versions, the following should be doable on CuPy v8.x:

import cupy


cupy.show_config()
cupy.cuda.set_allocator(cupy.cuda.malloc_managed)  # use cudaMallocManaged instead of cudaMalloc for CuPy's memory pool

with cupy.cuda.Device(0):
    a = cupy.ones(20 * 1024**3, dtype=cupy.int8)  # too large to be accommodated on 1 GTX 2080 Ti
    first_half = a[:10*1024**3]
    out_1 = first_half.sum()  # sum over the first half of the array on GPU 0

with cupy.cuda.Device(1):
    # Take the pointer to managed memory, and wrap it in a way that CuPy thinks there is already a legit array allocated on device 1
    mem = cupy.cuda.MemoryPointer(cupy.cuda.UnownedMemory(a[10*1024**3:].data.ptr, 10*1024**3, a, device_id=1), 0)
    second_half = cupy.ndarray(10*1024**3, dtype=cupy.int8, memptr=mem)
    out_2 = second_half.sum()  # sum over the second half of the array on GPU 1

# we print in the end, as we want both GPUs to work in parallel, but printing would incur data transfer and synchronization 
print()
print(out_1)
print(out_2)

Output:

OS                           : Linux-5.4.0-42-generic-x86_64-with-glibc2.10
CuPy Version                 : 8.2.0
NumPy Version                : 1.19.4
SciPy Version                : None
Cython Build Version         : 0.29.21
CUDA Root                    : /usr/local/cuda
CUDA Build Version           : 10000
CUDA Driver Version          : 11000
CUDA Runtime Version         : 10000
cuBLAS Version               : 10000
cuFFT Version                : 10000
cuRAND Version               : 10000
cuSOLVER Version             : (10, 0, 0)
cuSPARSE Version             : 10000
NVRTC Version                : (10, 0)
Thrust Version               : 100903
CUB Build Version            : 100800
cuDNN Build Version          : 7605
cuDNN Version                : 7605
NCCL Build Version           : 2604
NCCL Runtime Version         : 2604
cuTENSOR Version             : None
Device 0 Name                : GeForce RTX 2080 Ti
Device 0 Compute Capability  : 75
Device 1 Name                : GeForce RTX 2080 Ti
Device 1 Compute Capability  : 75

10737418240
10737418240

A snapshot of nvidia-smi showing both GPUs are kept busy (should really be verified using nvvp/nsys but I'm being sloppy):

| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:1A:00.0 Off |                  N/A |
| 18%   53C    P2    90W / 250W |  10653MiB / 11019MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:68:00.0 Off |                  N/A |
| 18%   44C    P2    91W / 250W |  10701MiB / 11016MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Again, the fact that this is possible does not mean it's guaranteed to work or as performant as you'd imagine for all kinds of tasks you need to do (in this case the majority of data is sitting on host for most of time). Just to show some interesting possibilities 馃槃 Either @pentschev's suggestion of using Dask or managing the array chunking yourself using, say, MPI/mpi4py is more realistic for production work.

Closing as we so far don't have a concrete plan to implement this feature.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jakirkham picture jakirkham  路  4Comments

mazzman101 picture mazzman101  路  3Comments

the-lay picture the-lay  路  4Comments

upul picture upul  路  3Comments

kmaehashi picture kmaehashi  路  3Comments