Horovod: mpirun will hang, no error, no output

Created on 24 Nov 2017  路  3Comments  路  Source: horovod/horovod

I test the tensorflow_word2vec.py in the exampless
mpirun -np 3 \ -H ml-208,ml-209,ml-210 \ -bind-to none -map-by slot \ -x NCCL_DEBUG=DEBUG -x LD_LIBRARY_PATH \ --mca btl_tcp_if_exclude virbr0,virbr0-nic -d \ python tensorflow_word2vec.py
I use the above command , but no output, no error.

question

Most helpful comment

I used the open-mpi-3.0.0. I have found the reason that my machines had many network devices and mpi don't known which card to use. Later I used the --mca btl_tcp_if_exclude parameters and excluse all irrelevant network devices then it works. Thank you all the same

All 3 comments

Which version of Open MPI are you using? Do you have multiple versions of Open MPI installed? I saw such behavior before where Open MPI in PATH was different from Open MPI an application was compiled with.

Another idea is to run strace mpirun ... and check where it gets stuck.

I used the open-mpi-3.0.0. I have found the reason that my machines had many network devices and mpi don't known which card to use. Later I used the --mca btl_tcp_if_exclude parameters and excluse all irrelevant network devices then it works. Thank you all the same

I'll close this issue as it appears to be resolved.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kangp3 picture kangp3  路  3Comments

guoyuanxiong picture guoyuanxiong  路  3Comments

goswamig picture goswamig  路  3Comments

ildoonet picture ildoonet  路  3Comments

shaarawy18 picture shaarawy18  路  3Comments