Environment:
Checklist:
Your question:
I have read this issue,
My understanding is that multiple GPUs on the same node have the same rank () value. I only have one local machine, this machine has multiple GPUs, I have tried the following commands:
horovodrun -np 3 -H localhost:3 python mytraincode.py
Hvd.size (), hvd.rank, hvd.local_rank () obtained in 3 workers are:
3,0,0
3,1,1
3,2,2
Why do workers on the same machine have different rank () values?
Hey @nyaang, rank is kind of like a "process ID". Every worker process will have a unique rank. In Horovod, we run one worker process for GPU most of the time, so you would expect every GPU to have a different rank.
So your output looks correct to me. If you were to run with 2 machines with 4 GPUs each, like this:
horovodrun -np 8 -H server1:4,server2:4 python mytraincode.py
You would see output like this:
8,0,0
8,1,1
8,2,2
8,3,3
8,4,0
8,5,1
8,6,2
8,7,3
Hope that helps, let me know if you still have questions.
Most helpful comment
Hey @nyaang, rank is kind of like a "process ID". Every worker process will have a unique rank. In Horovod, we run one worker process for GPU most of the time, so you would expect every GPU to have a different rank.
So your output looks correct to me. If you were to run with 2 machines with 4 GPUs each, like this:
You would see output like this:
Hope that helps, let me know if you still have questions.