I test the tensorflow_word2vec.py in the exampless
mpirun -np 3 \
-H ml-208,ml-209,ml-210 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=DEBUG -x LD_LIBRARY_PATH \
--mca btl_tcp_if_exclude virbr0,virbr0-nic -d \
python tensorflow_word2vec.py
I use the above command , but no output, no error.
Which version of Open MPI are you using? Do you have multiple versions of Open MPI installed? I saw such behavior before where Open MPI in PATH was different from Open MPI an application was compiled with.
Another idea is to run strace mpirun ... and check where it gets stuck.
I used the open-mpi-3.0.0. I have found the reason that my machines had many network devices and mpi don't known which card to use. Later I used the --mca btl_tcp_if_exclude parameters and excluse all irrelevant network devices then it works. Thank you all the same
I'll close this issue as it appears to be resolved.
Most helpful comment
I used the open-mpi-3.0.0. I have found the reason that my machines had many network devices and mpi don't known which card to use. Later I used the
--mca btl_tcp_if_excludeparameters and excluse all irrelevant network devices then it works. Thank you all the same