Yes, I have try to understand how the horovod works. However, in the init, I cannot understand the means of rank. Can someone give some hints?
# Initialize Horovod
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
Hey @OwenLiuzZ, the rank of a process is a unique ID given that distinguishes it from the other processes running your Horovod job at the same time. The local rank is also a unique ID, but specifically for all processes running your Horovod job _on the same node_.
In the code example you gave, suppose you're running a Horovod job on 2 machines, and each machine has 4 GPUs. So in total, you run 8 processes (size = 8). Then the rank will be a number [0, 7], and the local rank will be a number [0, 3].
The line of code that seems to be throwing you off:
config.gpu_options.visible_device_list = str(hvd.local_rank())
All that's doing is saying: "If I'm process X on this node, then use GPU X to run all my operations". That way, you have one process per GPU. This works because Tensorflow also identifies GPUs with a numerical ID (see the Tensorflow docs here).
You can read more about the MPI concepts of rank, size, local rank, etc. in the docs here.
@tgaddair Thanks a lot for your great explanation. That make sense to me about the concept of rank. As your explanation saying, another confusion point is that when I am running a Horovod job on 2 machines, and each machine has 4 GPUs. The local rank in the machine 1 will be a number [0, 3], So in my opinion, the local rank in another machine(machine 2) will be range of [4,7]? Or the machine2 also take the range of [0,3] in it's local rank?
Sorry for the question, I have seen it in your concepts Doc. Yes, all the local rank in each machine will be range of [0,3]
Most helpful comment
Hey @OwenLiuzZ, the rank of a process is a unique ID given that distinguishes it from the other processes running your Horovod job at the same time. The local rank is also a unique ID, but specifically for all processes running your Horovod job _on the same node_.
In the code example you gave, suppose you're running a Horovod job on 2 machines, and each machine has 4 GPUs. So in total, you run 8 processes (size = 8). Then the rank will be a number [0, 7], and the local rank will be a number [0, 3].
The line of code that seems to be throwing you off:
config.gpu_options.visible_device_list = str(hvd.local_rank())All that's doing is saying: "If I'm process X on this node, then use GPU X to run all my operations". That way, you have one process per GPU. This works because Tensorflow also identifies GPUs with a numerical ID (see the Tensorflow docs here).
You can read more about the MPI concepts of rank, size, local rank, etc. in the docs here.