Environment:
Checklist:
Your question:
Please ask your question here.
I am trying to run a distributed learning algorithm with 8 workers on a 4-GPU machine. Is it possible to allocate two processes to a single GPU in Horovod?
Hey @guoyuanxiong, when using NCCL, this is not supported, as NCCL does not allow the same GPU to be assigned to multiple ranks.
Can you give some more details about what it is you're trying to achieve? There may be another way to do what you're trying to do.
Thanks for your reply. I am trying to test the performance of my proposed distributed learning algorithm where multiple workers solve some optimization problems in parallel and exchange results iteratively. I want to simulate a non-trivial number of workers (e.g., 8 workers), but my machine only has four GPUs. My initial idea is to put several workers on a single GPU so that at least it works better than using only one GPU. Any idea on how shall I proceed given the current hardware restriction?
I thought it might be useful to chime in here. I was trying to do the same thing, multiple ranks per GPU. I hit the same issue.
Ultimately, I recompiled horovod and forced it to disable NCCL for communication. The scaling overhead was marginally higher but I made up for it with a performance boost in throughput, even for the same problem size.
I recompiled horovod with this command:
CC=mpicc CXX=mpicxx HOROVOD_CUDA_HOME=$CUDA_DIR HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_NCCL=0 python setup.py build
(and then subsequent install)
This was on Summit, so power9 CPUs + V100 GPUs. I was not using a traditional network but a sparse convolutional network. The model bounces back and forth between CPU and GPU, doing hash maps on CPU and matmul on GPU. As the batch size increases, the work on the CPU and the GPU is scaling at a different rate and eventually the CPU bottlenecks the throughput. But the memory and compute usage on the GPU is leaving a lot on the table, unused.
Without MPS mode, and for a fixed network and batch size (192, peak throughput for 1 rank), this network achieved 5.9 Img/s. Disabling NCCL and running more ranks on a single GPU with horovod, I saw increased throughput for the same problem:
2 ranks / GPU: 8.6 Img/s
4 ranks / GPU: 15.3 Img/s
8 ranks / GPU: 16.2 Img/s
All told, that's 2.7x speedup with exactly the same hardware. It also scaled out over an entire summit node, 4 ranks per GPU, 6 GPUs per node.
So, @guoyuanxiong I think if you build horovod without NCCL you should be able to run your workload as you describe above.
EDIT: I'm not proposing to reopen this issue. But I thought my experience would be useful for the next users who stumble on this page ...
Most helpful comment
I thought it might be useful to chime in here. I was trying to do the same thing, multiple ranks per GPU. I hit the same issue.
Ultimately, I recompiled horovod and forced it to disable NCCL for communication. The scaling overhead was marginally higher but I made up for it with a performance boost in throughput, even for the same problem size.
I recompiled horovod with this command:
CC=mpicc CXX=mpicxx HOROVOD_CUDA_HOME=$CUDA_DIR HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_NCCL=0 python setup.py build(and then subsequent install)
This was on Summit, so power9 CPUs + V100 GPUs. I was not using a traditional network but a sparse convolutional network. The model bounces back and forth between CPU and GPU, doing hash maps on CPU and matmul on GPU. As the batch size increases, the work on the CPU and the GPU is scaling at a different rate and eventually the CPU bottlenecks the throughput. But the memory and compute usage on the GPU is leaving a lot on the table, unused.
Without MPS mode, and for a fixed network and batch size (192, peak throughput for 1 rank), this network achieved 5.9 Img/s. Disabling NCCL and running more ranks on a single GPU with horovod, I saw increased throughput for the same problem:
2 ranks / GPU: 8.6 Img/s
4 ranks / GPU: 15.3 Img/s
8 ranks / GPU: 16.2 Img/s
All told, that's 2.7x speedup with exactly the same hardware. It also scaled out over an entire summit node, 4 ranks per GPU, 6 GPUs per node.
So, @guoyuanxiong I think if you build horovod without NCCL you should be able to run your workload as you describe above.
EDIT: I'm not proposing to reopen this issue. But I thought my experience would be useful for the next users who stumble on this page ...