Horovod: Question:Run model on specific GPU

Created on 19 Oct 2018 · 3Comments · Source: horovod/horovod

Does horovod support running the program on a specific GPU?

I have two compute nodes each with 3 GPUs.

When I want to use mpirun -np 4 -H node1:2,node2:2 python train.py comman to start training task with 4 GPUs, Horovod will select GPU0、GPU1 in node 1 and GPU0、GPU1 in node 2 as devices for training automatically. But if GPU0 in node2 is working, the task will fail.

Is there any configurations I can set to achieve that select GPU by myself？

For examle I want to run a model on node1:GPU0,GPU2 node2:GPU1,GPU2

Thanks！

question wontfix

Source

shaarawy18

Most helpful comment

Hey @shaarawy18, yes, you can specify running on a specific GPU. This is something that's actually not controlled by mpirun, but within your code. The mpirun command is just specifying how many "ranks" (processes) to execute on each node, not which GPUs to use.

The GPU assignment happens somewhere in your training script train.py with a line like this (typically):

config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

hvd.local_rank() says "use GPU with the same ID as the local rank", but you can in fact set the visible device list to be whatever you want. For example, to achieve what you're trying to do, you can explicitly map individual ranks to devices, like:

device_map = {
    0: 0,
    1: 2,
    2: 1,
    3: 2
}
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(device_map[hvd.rank()])

Does that make sense?

tgaddair on 19 Oct 2018

👍5

All 3 comments

The GPU assignment happens somewhere in your training script train.py with a line like this (typically):

config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

device_map = {
    0: 0,
    1: 2,
    2: 1,
    3: 2
}
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(device_map[hvd.rank()])

Does that make sense?

tgaddair on 19 Oct 2018

👍5

Thank you for your reply @tgaddair, I will experiment according to your suggestion.

shaarawy18 on 22 Oct 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.