Follow-up: #5798
Hi again,
It seems for Dask-cuDF, CUDA_VISIBLE_DEVICES is for specifying the (multiple) GPUs for compute, but with only one GPU's memory. I confirmed on my machine that CUDA_VISIBLE_DEVICES works well with multiple GPUs and Dask-cuDF combination. It worked but seems using only one GPU memory even over multiple GPUs.
GeForce RTX 2080 Ti, is this a supported GPU for distributing the memory?Thank you!
I doubt you are only using one GPU's memory. What symptoms are you seeing that make you think the other GPU's memory is not being used? It would probably be very slow if that were true.
GeForce does not support NVLink. Do you mean SLI?
I doubt you are only using one GPU's memory. What symptoms are you seeing that make you think the other GPU's memory is not being used? It would probably be very slow if that were true.
When one GPU's memory is full, MemoryError is happened even though another GPU's memory is still empty.
I tested with following demo code:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
import dask_cudf
ddf_ratings = {}
ddf_ratings_computed = {}
iter = 0
memory_full = False
while not memory_full:
try:
ddf_ratings[iter] = dask_cudf.read_csv("ml-25m/ratings.csv")
ddf_ratings_computed[iter] = ddf_ratings[iter].compute()
iter += 1
except MemoryError:
# Stop the creation when memory is full.
memory_full = True
print(iter, "dataframe created.")
Full demo code: https://colab.research.google.com/drive/1h4NE3whF6bYHUFb1H_C0Uw4NjpR_dWT5?usp=sharing
The result is like:
$ nvidia-smi
Mon Aug 3 16:10:36 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:02:00.0 Off | N/A |
| 43% 42C P8 11W / 250W | 10329MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:81:00.0 Off | N/A |
| 37% 35C P8 12W / 250W | 12MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 52668 C ...dblab/anaconda3/envs/jupyter/bin/python 10317MiB |
+-----------------------------------------------------------------------------+
GeForce does not support NVLink. Do you mean SLI?
I am not sure following results are pointing that these GPUs are connected as NVLink or SLI...
$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity
GPU0 X NV2 0-7,16-23
GPU1 NV2 X 8-15,24-31
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
$ nvidia-smi nvlink -s
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-<snipped>)
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-<snipped>)
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s
Thank you!
Thanks for sharing this example. Dask will only use a single GPU by default, so this is expected behavior.
If you want to use multiple GPUs, you'll want to either use the LocalCUDACluster API or launch two dask workers from the command line (one per GPU). Note that, for your example use case, you may want to use LocalCUDACluster.
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
# Create a Dask Cluster with one worker per GPU
cluster = LocalCUDACluster()
client = Client(cluster)
With that setup, Dask will distribute work over all GPUs in your machine. For more information on how to use Dask with GPUs, please see the Dask-CUDA docs.
Also note that when you call compute you bring all of the result data for that task to the client process on a single GPU, which can cause memory problems. You may want to persist your data into distributed memory. Please see the Dask docs for more information.
Most helpful comment
Thanks for sharing this example. Dask will only use a single GPU by default, so this is expected behavior.
If you want to use multiple GPUs, you'll want to either use the
LocalCUDAClusterAPI or launch two dask workers from the command line (one per GPU). Note that, for your example use case, you may want to useLocalCUDACluster.With that setup, Dask will distribute work over all GPUs in your machine. For more information on how to use Dask with GPUs, please see the Dask-CUDA docs.
Also note that when you call
computeyou bring all of the result data for that task to the client process on a single GPU, which can cause memory problems. You may want topersistyour data into distributed memory. Please see the Dask docs for more information.