I want to create and manipulate a cupy array that doesn't fit on a single GPU memory, but it does on two. When I create the array, I was expecting that cupy would automatically use the combined memory of the two GPUs I have to store the array, but I get OutOfMemoryError.
Is there a way around it?
Let me show you a very hacky way to achieve simple operations using managed memory. CuPy was not intended to be used in this way AFAIK, at least I don't think we included tests like this, but with all the flexibilities gradually built up in the past few versions, the following should be doable on CuPy v8.x:
import cupy
cupy.show_config()
cupy.cuda.set_allocator(cupy.cuda.malloc_managed) # use cudaMallocManaged instead of cudaMalloc for CuPy's memory pool
with cupy.cuda.Device(0):
a = cupy.ones(20 * 1024**3, dtype=cupy.int8) # too large to be accommodated on 1 GTX 2080 Ti
first_half = a[:10*1024**3]
out_1 = first_half.sum() # sum over the first half of the array on GPU 0
with cupy.cuda.Device(1):
# Take the pointer to managed memory, and wrap it in a way that CuPy thinks there is already a legit array allocated on device 1
mem = cupy.cuda.MemoryPointer(cupy.cuda.UnownedMemory(a[10*1024**3:].data.ptr, 10*1024**3, a, device_id=1), 0)
second_half = cupy.ndarray(10*1024**3, dtype=cupy.int8, memptr=mem)
out_2 = second_half.sum() # sum over the second half of the array on GPU 1
# we print in the end, as we want both GPUs to work in parallel, but printing would incur data transfer and synchronization
print()
print(out_1)
print(out_2)
Output:
OS : Linux-5.4.0-42-generic-x86_64-with-glibc2.10
CuPy Version : 8.2.0
NumPy Version : 1.19.4
SciPy Version : None
Cython Build Version : 0.29.21
CUDA Root : /usr/local/cuda
CUDA Build Version : 10000
CUDA Driver Version : 11000
CUDA Runtime Version : 10000
cuBLAS Version : 10000
cuFFT Version : 10000
cuRAND Version : 10000
cuSOLVER Version : (10, 0, 0)
cuSPARSE Version : 10000
NVRTC Version : (10, 0)
Thrust Version : 100903
CUB Build Version : 100800
cuDNN Build Version : 7605
cuDNN Version : 7605
NCCL Build Version : 2604
NCCL Runtime Version : 2604
cuTENSOR Version : None
Device 0 Name : GeForce RTX 2080 Ti
Device 0 Compute Capability : 75
Device 1 Name : GeForce RTX 2080 Ti
Device 1 Compute Capability : 75
10737418240
10737418240
A snapshot of nvidia-smi showing both GPUs are kept busy (should really be verified using nvvp/nsys but I'm being sloppy):
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:1A:00.0 Off | N/A |
| 18% 53C P2 90W / 250W | 10653MiB / 11019MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:68:00.0 Off | N/A |
| 18% 44C P2 91W / 250W | 10701MiB / 11016MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Again, the fact that this is possible does not mean it's guaranteed to work or as performant as you'd imagine for all kinds of tasks you need to do (in this case the majority of data is sitting on host for most of time). Just to show some interesting possibilities 馃槃 Either @pentschev's suggestion of using Dask or managing the array chunking yourself using, say, MPI/mpi4py is more realistic for production work.
Closing as we so far don't have a concrete plan to implement this feature.
Most helpful comment
Let me show you a very hacky way to achieve simple operations using managed memory. CuPy was not intended to be used in this way AFAIK, at least I don't think we included tests like this, but with all the flexibilities gradually built up in the past few versions, the following should be doable on CuPy v8.x:
Output:
A snapshot of
nvidia-smishowing both GPUs are kept busy (should really be verified usingnvvp/nsysbut I'm being sloppy):Again, the fact that this is possible does not mean it's guaranteed to work or as performant as you'd imagine for all kinds of tasks you need to do (in this case the majority of data is sitting on host for most of time). Just to show some interesting possibilities 馃槃 Either @pentschev's suggestion of using Dask or managing the array chunking yourself using, say, MPI/mpi4py is more realistic for production work.