Thanks for your great work! However, I run into problems when the working directories and anaconda environments are different for different workers. When I run the following command on the host (worker0):
mpirun -np 8 \
-H worker0:4,worker1:4\
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
it simply tells me that can not find train.py on worker1, besides, when I try the command which python, only the python root returned from worker0 is the one I expected, e.g., /home/tan/anaconda2/envs/pointnet2/bin/python, while the python root returned from worker1 is: /home/tan/anaconda2/bin/python, which is the default python environment and this is not what I want.
Everything works fine when tested on worker0 with the following command (of course the working directory and conda env have been switched to the correct ones with source activate):
mpirun -np 4 \
-H worker0:4\
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python train.py
I know putting the file into the same directory and giving the conda env the same name for different workers then specify them via the command line may solve the problem, but is there any better way?Thank you very much!
UPDATE:
inspired by this, I modified the command into:
conda_env_worker0="/home/tan/anaconda2/envs/<ENV_NAME@worker0>/bin/python"
program_dir_worker0="<FILE_PATH@worker0>/train.py"
conda_env_worker1="/home/tan/anaconda2/envs/<ENV_NAME@worker1>/bin/python"
program_dir_worker1="<FILE_PATH@worker1>/train.py"
# NOTE: The anaconda home directory is the same on my worker0 and 1.
mpirun -np 4 \
-H worker0:4\
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO \
-mca pml ob1 -mca btl ^openib \
$conda_env_worker0 $program_dir_worker0 : \
-H worker1:4\
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO \
-mca pml ob1 -mca btl ^openib \
$conda_env_worker1 $program_dir_worker1
And now it finally works.
Most helpful comment
UPDATE:
inspired by this, I modified the command into:
And now it finally works.