Horovod: multiple gpu in one local machine have different hvd.rank()

Created on 26 Dec 2019  Â·  1Comment  Â·  Source: horovod/horovod

Environment:

  1. Framework: Keras
  2. Framework version:2.1.5
  3. Horovod version:0.18.2
  4. MPI version:4.0.1
  5. CUDA version:10.0.130
  6. NCCL version:None
  7. Python version:3.7.5
  8. OS and version:Centos 7.7.1908
  9. GCC version:7.3.1

Checklist:

  1. Did you search issues to find if somebody asked this question before?
  2. If your question is about hang, did you read this doc?
  3. If your question is about docker, did you read this doc?
  4. Did you check if you question is answered in the troubleshooting guide?

Your question:
I have read this issue,
My understanding is that multiple GPUs on the same node have the same rank () value. I only have one local machine, this machine has multiple GPUs, I have tried the following commands:
horovodrun -np 3 -H localhost:3 python mytraincode.py
Hvd.size (), hvd.rank, hvd.local_rank () obtained in 3 workers are:
3,0,0
3,1,1
3,2,2
Why do workers on the same machine have different rank () values?

question

Most helpful comment

Hey @nyaang, rank is kind of like a "process ID". Every worker process will have a unique rank. In Horovod, we run one worker process for GPU most of the time, so you would expect every GPU to have a different rank.

So your output looks correct to me. If you were to run with 2 machines with 4 GPUs each, like this:

horovodrun -np 8 -H server1:4,server2:4 python mytraincode.py

You would see output like this:

8,0,0
8,1,1
8,2,2
8,3,3
8,4,0
8,5,1
8,6,2
8,7,3

Hope that helps, let me know if you still have questions.

>All comments

Hey @nyaang, rank is kind of like a "process ID". Every worker process will have a unique rank. In Horovod, we run one worker process for GPU most of the time, so you would expect every GPU to have a different rank.

So your output looks correct to me. If you were to run with 2 machines with 4 GPUs each, like this:

horovodrun -np 8 -H server1:4,server2:4 python mytraincode.py

You would see output like this:

8,0,0
8,1,1
8,2,2
8,3,3
8,4,0
8,5,1
8,6,2
8,7,3

Hope that helps, let me know if you still have questions.

Was this page helpful?
0 / 5 - 0 ratings