Horovod: Question:Run model on specific GPU

Created on 19 Oct 2018  ·  3Comments  ·  Source: horovod/horovod

Does horovod support running the program on a specific GPU?

I have two compute nodes each with 3 GPUs.

When I want to use mpirun -np 4 -H node1:2,node2:2 python train.py comman to start training task with 4 GPUs, Horovod will select GPU0、GPU1 in node 1 and GPU0、GPU1 in node 2 as devices for training automatically. But if GPU0 in node2 is working, the task will fail.

Is there any configurations I can set to achieve that select GPU by myself?

For examle I want to run a model on node1:GPU0,GPU2 node2:GPU1,GPU2

Thanks!

question wontfix

Most helpful comment

Hey @shaarawy18, yes, you can specify running on a specific GPU. This is something that's actually not controlled by mpirun, but within your code. The mpirun command is just specifying how many "ranks" (processes) to execute on each node, not which GPUs to use.

The GPU assignment happens somewhere in your training script train.py with a line like this (typically):

config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

hvd.local_rank() says "use GPU with the same ID as the local rank", but you can in fact set the visible device list to be whatever you want. For example, to achieve what you're trying to do, you can explicitly map individual ranks to devices, like:

device_map = {
    0: 0,
    1: 2,
    2: 1,
    3: 2
}
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(device_map[hvd.rank()])

Does that make sense?

All 3 comments

Hey @shaarawy18, yes, you can specify running on a specific GPU. This is something that's actually not controlled by mpirun, but within your code. The mpirun command is just specifying how many "ranks" (processes) to execute on each node, not which GPUs to use.

The GPU assignment happens somewhere in your training script train.py with a line like this (typically):

config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

hvd.local_rank() says "use GPU with the same ID as the local rank", but you can in fact set the visible device list to be whatever you want. For example, to achieve what you're trying to do, you can explicitly map individual ranks to devices, like:

device_map = {
    0: 0,
    1: 2,
    2: 1,
    3: 2
}
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(device_map[hvd.rank()])

Does that make sense?

Thank you for your reply @tgaddair, I will experiment according to your suggestion.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

guoyuanxiong picture guoyuanxiong  ·  3Comments

kangp3 picture kangp3  ·  3Comments

UditGupta10 picture UditGupta10  ·  3Comments

goswamig picture goswamig  ·  3Comments

ildoonet picture ildoonet  ·  3Comments