Horovod: what is the means of rank in horovod?

Created on 28 Jul 2018 · 3Comments · Source: horovod/horovod

Yes, I have try to understand how the horovod works. However, in the init, I cannot understand the means of rank. Can someone give some hints?
# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

question

Source

OwenLiuzZ

Most helpful comment

Hey @OwenLiuzZ, the rank of a process is a unique ID given that distinguishes it from the other processes running your Horovod job at the same time. The local rank is also a unique ID, but specifically for all processes running your Horovod job _on the same node_.

In the code example you gave, suppose you're running a Horovod job on 2 machines, and each machine has 4 GPUs. So in total, you run 8 processes (size = 8). Then the rank will be a number [0, 7], and the local rank will be a number [0, 3].

The line of code that seems to be throwing you off:

config.gpu_options.visible_device_list = str(hvd.local_rank())

All that's doing is saying: "If I'm process X on this node, then use GPU X to run all my operations". That way, you have one process per GPU. This works because Tensorflow also identifies GPUs with a numerical ID (see the Tensorflow docs here).

You can read more about the MPI concepts of rank, size, local rank, etc. in the docs here.

tgaddair on 29 Jul 2018

👍16 ❤2

All 3 comments

The line of code that seems to be throwing you off:

config.gpu_options.visible_device_list = str(hvd.local_rank())

You can read more about the MPI concepts of rank, size, local rank, etc. in the docs here.

tgaddair on 29 Jul 2018

👍16 ❤2

@tgaddair Thanks a lot for your great explanation. That make sense to me about the concept of rank. As your explanation saying, another confusion point is that when I am running a Horovod job on 2 machines, and each machine has 4 GPUs. The local rank in the machine 1 will be a number [0, 3], So in my opinion, the local rank in another machine(machine 2) will be range of [4,7]? Or the machine2 also take the range of [0,3] in it's local rank?

OwenLiuzZ on 29 Jul 2018

Sorry for the question, I have seen it in your concepts Doc. Yes, all the local rank in each machine will be range of [0,3]

OwenLiuzZ on 29 Jul 2018

Was this page helpful?

0 / 5 - 0 ratings